Forecasting Sales of Favorita Grocery Stores¶
Author: Moritz Grimm
Version: 20 January 2026
Executive Summary¶
This project explores the forecast of retail sales in Ecuadorian grocery chain Favorita. Both classical machine learning approaches (simple and custom recursive LightGBM) and transformer-based models (Temporal Fusion Transformer) are employed. The entire pipeline covers careful data preprocessing and feature engineering, clustering the 1784 time series by their shape, feature selection, model comparison, hyperparameter optimization with Optuna and Error Analysis. Special attention is paid to leakage-aware cross-validation and modeling, as well as reproducibility — i.e., all data processing steps and pipeline stages are fully documented and fully automated, allowing the entire process to be rerun consistently.
import json
from IPython.display import Markdown
import re
def generate_toc_from_ipynb(filepath: str, max_level: int = 2, note: str = "") -> Markdown:
"""Generate a Markdown-formatted table of contents from a Jupyter Notebook.
Scans all markdown cells and includes headings up to the specified max_level depth."""
with open(filepath, "r", encoding="utf-8") as f:
notebook = json.load(f)
toc_lines = ["## Table of Contents", " - [Executive Summary](#Executive-Summary)"]
for cell in notebook["cells"]:
if cell["cell_type"] == "markdown":
for line in cell["source"]:
# all from ## level on
if (line.strip().startswith("##")) and (not line.strip().startswith("## Executive Summary")):
heading = line.strip()
level = heading.split(" ")[0]
indent_factor = len(level) - 1
if indent_factor in range(1, max_level + 1):
#anchor = heading.replace(level, "")
title = re.sub(f"{level}|<.*>", "", heading).strip()
#title = heading.replace(level, "").strip()
anchor = title.replace(" ", "-") + "-↑"
indent = " " * indent_factor
toc_lines.append(f"{indent}- [{title}](#{anchor})")
toc_lines.append(f"\n{note}")
return Markdown("\n".join(toc_lines))
# use the function to create the ToC
note = """This table of contents is generated automatically from notebook headers."""
generate_toc_from_ipynb("favoritas_nb.ipynb", max_level=3, note=note)
Table of Contents¶
- Executive Summary
- Key Results Summary
- Introduction
- Compiling the Dataset
- Explorative Data Analysis
- Model Training on Sample Series
- Model Training using All Series
- What else could be done
- The Dataset from a Sociological and Methodological Perspective
- Why Time Series Forecasting Is Fascinating
- Conclusions
This table of contents is generated automatically from notebook headers.
Key Results Summary ↑¶
- Strong differences in scale across series and the presence of high outliers required careful target scaling, calendar feature extraction, and shape-based clustering of the time series.
- Classical models (a simple LightGBM model with covariates and a custom recursive LightGBM model with lagged features) outperformed a Temporal Fusion Transformer on the evaluation folds.
- While both classical models struggled with the regime shift between years, the recursive model fell far short of the baseline in this period, despite performing slightly better than the simple model on some other cross-validation folds.
- The simple LightGBM model was selected for hyperparameter optimization using Optuna with Bayesian optimization, as it showed the most reliable performance across folds.
- Error analysis revealed that forecast accuracy still varies across series and that product families differ substantially in their degree of predictability.
- Evaluation on the test set yielded performance very similar to that of the last cross-validation fold, with only mild and expected overfitting.
- The entire pipeline was designed to be reproducible and transparent.
Introduction ↑¶
This project uses the Favorita dataset with sales aggregated to product family level, which, despite its simplified label, offers rich time series data across multiple stores and product families of Ecuadorian grocery chain Favorita. With over 3 million rows and contextual features like promotions, holidays and external shocks, it provides an ideal environment for applying advanced forecasting methods. Forecasting sales is a relevant business concern, since it helps maximizing sales by optimizing valuable storage space and such lowering the risk that items are sold-out.
I demonstrate techniques such as locale-aware holiday calendar creation, clustering the time series by their shape, informed sampling strategies, time-aware cross-validation, custom feature engineering and recursive model creation, fitting a Temporal Fusion Transformer, and model selection. The goal is to treat this dataset not as a toy problem, but as a realistic forecasting challenge—much like those found in actual retail or operations use cases.
In this project, I focus exclusively on features that are known at the time of prediction. This includes static store and product information as well as calendar-based features such as day-of-week, holidays, and seasonality patterns. While additional signals like transactions and oil prices are available for the training window and can be informative, they are not available for the forecast horizon and would require separate modeling or estimation. To maintain a clean and realistic forecasting setup, I deliberately exclude them from the feature set.
First, we need to load the data, which is split across multiple files.
We will start by reading the file that contains the labels we aim to forecast — specifically, sales data 14 days into the future. While doing so, we also convert the date column directly into a proper pandas datetime format
import pandas as pd
import numpy as np
sales = (pd.read_csv("train.csv")
.assign(date=lambda d: pd.to_datetime(d.date),
store_nbr=lambda d: d.store_nbr.astype("str"))) # for same sorting as older notebook versions
sales
| id | date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.000 | 0 |
| 1 | 1 | 2013-01-01 | 1 | BABY CARE | 0.000 | 0 |
| 2 | 2 | 2013-01-01 | 1 | BEAUTY | 0.000 | 0 |
| 3 | 3 | 2013-01-01 | 1 | BEVERAGES | 0.000 | 0 |
| 4 | 4 | 2013-01-01 | 1 | BOOKS | 0.000 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 3000883 | 3000883 | 2017-08-15 | 9 | POULTRY | 438.133 | 0 |
| 3000884 | 3000884 | 2017-08-15 | 9 | PREPARED FOODS | 154.553 | 1 |
| 3000885 | 3000885 | 2017-08-15 | 9 | PRODUCE | 2419.729 | 148 |
| 3000886 | 3000886 | 2017-08-15 | 9 | SCHOOL AND OFFICE SUPPLIES | 121.000 | 8 |
| 3000887 | 3000887 | 2017-08-15 | 9 | SEAFOOD | 16.000 | 0 |
3000888 rows × 6 columns
The data frame consists of over three million rows and six columns. These columns describe how many articles were sold on a given day (date), in which store (store_nbr), and in which product family (family). The number of articles sold is recorded in the sales column. The onpromotion column gives the total number of articles in a product family that were being promoted, and the id column provides a unique identifier for each entry in the table.
We can see that there are no missing values:
sales.isna().sum()
id 0 date 0 store_nbr 0 family 0 sales 0 onpromotion 0 dtype: int64
From this first data frame, we can further extract that there are 54 different stores and 33 different product families, resulting in 54 × 33 = 1,782 time series that we will need to forecast simultaneously. Each of these time series has a length of 1,684 days:
stores = np.sort(sales["store_nbr"].unique())
families = np.sort(sales["family"].unique())
num_stores = stores.shape[0]
num_families = families.shape[0]
num_series = num_stores*num_families
series_length = sales.shape[0]//num_series
print(f"Number of stores: {num_stores}\n"
f"Number of product families: {num_families}\n"
f"Number of time series: {num_series}\n"
f"Length of each time series: {series_length}")
Number of stores: 54 Number of product families: 33 Number of time series: 1782 Length of each time series: 1684
The fact that the number of rows divided by the number of time series returns a whole number suggests that each time series covers the same interval of time and is complete. However, let us verify this systematically to be sure:
# full set of expected consecutive dates
all_dates = set(sales["date"].unique())
# check for each time series (combination of store_nbr and family) if its dates equal the ones of all_dates:
completeness_df = (sales.groupby(["store_nbr", "family"], observed=True)
.apply(lambda df: set(df["date"]) == all_dates, include_groups=False)
.reset_index(name="is_complete"))
completeness_df
| store_nbr | family | is_complete | |
|---|---|---|---|
| 0 | 1 | AUTOMOTIVE | True |
| 1 | 1 | BABY CARE | True |
| 2 | 1 | BEAUTY | True |
| 3 | 1 | BEVERAGES | True |
| 4 | 1 | BOOKS | True |
| ... | ... | ... | ... |
| 1777 | 9 | POULTRY | True |
| 1778 | 9 | PREPARED FOODS | True |
| 1779 | 9 | PRODUCE | True |
| 1780 | 9 | SCHOOL AND OFFICE SUPPLIES | True |
| 1781 | 9 | SEAFOOD | True |
1782 rows × 3 columns
This looks good so far, now let us check if is_complete equals True for every time series:
completeness_df["is_complete"].value_counts()
is_complete True 1782 Name: count, dtype: int64
All values are True, meaning that each time series contains exactly the same dates. However, it is still possible that there are gaps within the time series — missing dates that could distort seasonality patterns, especially weekly seasonality. Therefore, we now check whether all dates in the dataset are consecutive:
# all dates from the first to the last day
expected_range = pd.date_range(sales.date.min(), sales.date.max())
# all dates that appear in the data frame
actual_dates = pd.Series(sorted(sales["date"].unique()))
# check if there are dates that do not appear in the data frame
gap_dates = expected_range[~expected_range.isin(actual_dates)]
gap_dates
DatetimeIndex(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='datetime64[ns]', freq=None)
We can see that the Christmas holiday never appears in the data, because the stores are closed on those days. We can address this easily by adding the missing dates and assigning a sales value of zero for each time series:
# multi index with all expected dates, stores and families
multi_idx = pd.MultiIndex.from_product(
[expected_range, stores, families],
# date_range necessary, because December 25 of every year is not included in the date column
names=["date", "store_nbr", "family"],
)
# fill in the missing dates of 25 December every year and drop store_nbr and family out of the dataset
sales = (sales.set_index(["date", "store_nbr", "family"])
.reindex(multi_idx)
.reset_index(["store_nbr", "family", "date"])
.sort_values(["store_nbr", "family", "date"])) # we order the dataframe as stacked timeseries
# fill missing values with 0s
sales[["sales", "onpromotion"]] = sales[["sales", "onpromotion"]].fillna(0.)
sales.id = sales.id.interpolate(method="linear") # linear interpolation
sales.loc[sales.date.isin(gap_dates)]
| date | store_nbr | family | id | sales | onpromotion | |
|---|---|---|---|---|---|---|
| 637956 | 2013-12-25 | 1 | AUTOMOTIVE | 637065.0 | 0.0 | 0.0 |
| 1288386 | 2014-12-25 | 1 | AUTOMOTIVE | 1285713.0 | 0.0 | 0.0 |
| 1938816 | 2015-12-25 | 1 | AUTOMOTIVE | 1934361.0 | 0.0 | 0.0 |
| 2591028 | 2016-12-25 | 1 | AUTOMOTIVE | 2584791.0 | 0.0 | 0.0 |
| 637957 | 2013-12-25 | 1 | BABY CARE | 637066.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2592808 | 2016-12-25 | 9 | SCHOOL AND OFFICE SUPPLIES | 2586571.0 | 0.0 | 0.0 |
| 639737 | 2013-12-25 | 9 | SEAFOOD | 638846.0 | 0.0 | 0.0 |
| 1290167 | 2014-12-25 | 9 | SEAFOOD | 1287494.0 | 0.0 | 0.0 |
| 1940597 | 2015-12-25 | 9 | SEAFOOD | 1936142.0 | 0.0 | 0.0 |
| 2592809 | 2016-12-25 | 9 | SEAFOOD | 2586572.0 | 0.0 | 0.0 |
7128 rows × 6 columns
sales.shape[0]
3008016
Now that we have complete time series without any gaps we can compute a unique identifier for each time series, combining the store number and the product family, which makes handling them—such as iterating over them—much easier.
sales["series_id"] = sales["store_nbr"].astype(str).str.cat(sales["family"], sep="_")
sales["store_nbr"] = sales["store_nbr"].astype(int) # easier to handle in joints, plots, etc.
sales[["series_id", "store_nbr", "family"]]
| series_id | store_nbr | family | |
|---|---|---|---|
| 0 | 1_AUTOMOTIVE | 1 | AUTOMOTIVE |
| 1782 | 1_AUTOMOTIVE | 1 | AUTOMOTIVE |
| 3564 | 1_AUTOMOTIVE | 1 | AUTOMOTIVE |
| 5346 | 1_AUTOMOTIVE | 1 | AUTOMOTIVE |
| 7128 | 1_AUTOMOTIVE | 1 | AUTOMOTIVE |
| ... | ... | ... | ... |
| 3000887 | 9_SEAFOOD | 9 | SEAFOOD |
| 3002669 | 9_SEAFOOD | 9 | SEAFOOD |
| 3004451 | 9_SEAFOOD | 9 | SEAFOOD |
| 3006233 | 9_SEAFOOD | 9 | SEAFOOD |
| 3008015 | 9_SEAFOOD | 9 | SEAFOOD |
3008016 rows × 3 columns
Now that we have fixed this and our dataframes size has grown up to 3,008,016 rows, we will take a closer look at the details about each store, which are stored in another file. Since all columns contain categorical data, we cast the entire data frame to string format:
stores_df = pd.read_csv("stores.csv")
stores_df.head(5)
| store_nbr | city | state | type | cluster | |
|---|---|---|---|---|---|
| 0 | 1 | Quito | Pichincha | D | 13 |
| 1 | 2 | Quito | Pichincha | D | 13 |
| 2 | 3 | Quito | Pichincha | D | 8 |
| 3 | 4 | Quito | Pichincha | D | 9 |
| 4 | 5 | Santo Domingo | Santo Domingo de los Tsachilas | D | 4 |
As the column names suggest we find information about the city and the state where each store is located as well as to which type and cluster the store belongs (a grouping of similar stores). The type column is not officially explained by Favoritas, so we will have to infer what it may represent.
The stores are located in 22 cities, 16 states and classified in five types and 17 clusters:
for col in ["city", "state", "type", "cluster"]:
print(f"Number of unique values in column `{col}`: {stores_df[col].nunique()}")
Number of unique values in column `city`: 22 Number of unique values in column `state`: 16 Number of unique values in column `type`: 5 Number of unique values in column `cluster`: 17
We now merge the sales dataframe with the stores dataframe:
sales__after_stores = pd.merge(sales, stores_df, how="left", on="store_nbr")
sales__after_stores
| date | store_nbr | family | id | sales | onpromotion | series_id | city | state | type | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | 13 |
| 1 | 2013-01-02 | 1 | AUTOMOTIVE | 1782.0 | 2.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | 13 |
| 2 | 2013-01-03 | 1 | AUTOMOTIVE | 3564.0 | 3.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | 13 |
| 3 | 2013-01-04 | 1 | AUTOMOTIVE | 5346.0 | 3.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | 13 |
| 4 | 2013-01-05 | 1 | AUTOMOTIVE | 7128.0 | 5.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | 13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3008011 | 2017-08-11 | 9 | SEAFOOD | 2993759.0 | 23.831000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | 6 |
| 3008012 | 2017-08-12 | 9 | SEAFOOD | 2995541.0 | 16.859001 | 4.0 | 9_SEAFOOD | Quito | Pichincha | B | 6 |
| 3008013 | 2017-08-13 | 9 | SEAFOOD | 2997323.0 | 20.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | 6 |
| 3008014 | 2017-08-14 | 9 | SEAFOOD | 2999105.0 | 17.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | 6 |
| 3008015 | 2017-08-15 | 9 | SEAFOOD | 3000887.0 | 16.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | 6 |
3008016 rows × 11 columns
holidays = pd.read_csv("holidays_events.csv").assign(date=lambda d: pd.to_datetime(d.date))
holidays
| date | type | locale | locale_name | description | transferred | |
|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Holiday | Local | Manta | Fundacion de Manta | False |
| 1 | 2012-04-01 | Holiday | Regional | Cotopaxi | Provincializacion de Cotopaxi | False |
| 2 | 2012-04-12 | Holiday | Local | Cuenca | Fundacion de Cuenca | False |
| 3 | 2012-04-14 | Holiday | Local | Libertad | Cantonizacion de Libertad | False |
| 4 | 2012-04-21 | Holiday | Local | Riobamba | Cantonizacion de Riobamba | False |
| ... | ... | ... | ... | ... | ... | ... |
| 345 | 2017-12-22 | Additional | National | Ecuador | Navidad-3 | False |
| 346 | 2017-12-23 | Additional | National | Ecuador | Navidad-2 | False |
| 347 | 2017-12-24 | Additional | National | Ecuador | Navidad-1 | False |
| 348 | 2017-12-25 | Holiday | National | Ecuador | Navidad | False |
| 349 | 2017-12-26 | Additional | National | Ecuador | Navidad+1 | False |
350 rows × 6 columns
Let us briefly summarize this table to improve understanding:
for col in ["type", "locale", "locale_name", "description"]:
msg = f"Number of unique values of column `{col}`: {holidays[col].nunique()}"
if col in ["type", "locale"]:
msg += f", with unique values: {holidays[col].unique()}"
print(msg)
Number of unique values of column `type`: 6, with unique values: ['Holiday' 'Transfer' 'Additional' 'Bridge' 'Work Day' 'Event'] Number of unique values of column `locale`: 3, with unique values: ['Local' 'Regional' 'National'] Number of unique values of column `locale_name`: 24 Number of unique values of column `description`: 103
The holidays DataFrame is quite complex, so let me explain its structure and pitfalls, that can easily be overseen.
First, the name holidays is slightly misleading, as the table also includes special events and other exceptions that are not days off from work.
From the table and its summary, we learn that there are 350 unique dates that are somehow special and need to be treated differently. However, some of these precede the sales time window. When we reduce the holidays table to only include dates that fall within our time series, we are left with 309 special dates to consider.
Holiday Types
There are six different type values in the table:
Holiday: These are standard public holidays. However, some of them were transferred to another date. If so, the originalHolidayrow will havetransferred == True, and the day is treated like a normal workday.Transfer: This is the date when a transferred holiday is actually celebrated.Additional: These are extra holidays added to existing ones, typically to extend free time around events like Christmas.Bridge: These are Mondays or Fridays added to create long weekends when a holiday falls near a weekend.Work Day: To compensate for extended holidays, certain Saturdays were declared to be regular workdays.Event: These are events, such as:- national or international occurrences like the World Cup (even though it was held in Brazil),
- natural disasters like the earthquake in Manabí,
- or recurring commercial events like Mother’s Day, Black Friday, and Cyber Monday.
Locale Levels
The locale column differentiates between three levels:
Local: The event applies only to a specific city, given inlocale_name, which always matches a city instore_df.Regional: The event applies to a specific state, also matched vialocale_nameandstore_df.National: The event applies to the whole country of Ecuador.
These different scope levels mean that a simple join with the sales__after_stores dataframe is not possible. Instead:
Nationalholidays must be joined on date onlyRegionalholidays require a join on date and stateLocalholidays require a join on date and city
Additionally:
- Only
HolidayandTransfertypes require consideration of thetransferredcolumn - All other types (
Additional,Bridge,Event,Work Day) are not transferable EventandWork Daytypes always apply nationwide.
Handling Overlaps and Deduplication ↑¶
A further complication is that some dates include multiple entries, occasionally even for the same locale level. While most overlaps occur between different locale levels (e.g., a local and a national event on the same day), in rare cases, two holidays exist on the same date within the same locale.
This approach allows to maintain a rich, informative feature set while avoiding duplicate joins or lookup ambiguities.
But first, we need to sort the DataFrame and add a city column and by mapping over this column also a state column, which makes some comparisons easier.
# sort the table
holidays = holidays.sort_values(["date", "locale_name"])
# create city column
holidays["city"] = [city if locale == "Local" else np.nan
for city, locale in holidays[["locale_name", "locale"]].values]
# create state column, starting with the states derived from the city
city_states = stores_df[["city", "state"]].drop_duplicates().set_index("city")["state"]
holidays["state"] = holidays["city"].map(city_states)
# adding the states, where the locale level is not a city but a state (locale=="Regional") itself:
holidays.loc[holidays.locale == "Regional", "state"] = holidays.loc[holidays.locale == "Regional", "locale_name"]
holidays = holidays.fillna("None")
holidays
| date | type | locale | locale_name | description | transferred | city | state | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Holiday | Local | Manta | Fundacion de Manta | False | Manta | Manabi |
| 1 | 2012-04-01 | Holiday | Regional | Cotopaxi | Provincializacion de Cotopaxi | False | None | Cotopaxi |
| 2 | 2012-04-12 | Holiday | Local | Cuenca | Fundacion de Cuenca | False | Cuenca | Azuay |
| 3 | 2012-04-14 | Holiday | Local | Libertad | Cantonizacion de Libertad | False | Libertad | Guayas |
| 4 | 2012-04-21 | Holiday | Local | Riobamba | Cantonizacion de Riobamba | False | Riobamba | Chimborazo |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 344 | 2017-12-22 | Holiday | Local | Salinas | Cantonizacion de Salinas | False | Salinas | Santa Elena |
| 346 | 2017-12-23 | Additional | National | Ecuador | Navidad-2 | False | None | None |
| 347 | 2017-12-24 | Additional | National | Ecuador | Navidad-1 | False | None | None |
| 348 | 2017-12-25 | Holiday | National | Ecuador | Navidad | False | None | None |
| 349 | 2017-12-26 | Additional | National | Ecuador | Navidad+1 | False | None | None |
350 rows × 8 columns
We find that there are no regional holidays within a given state that collide with local holidays in cities of the same state:
holidays_locale_overlap_days = holidays[(holidays.date == holidays.date.shift(1)) &
(holidays.locale != holidays.locale.shift(1)) &
(holidays.state == holidays.state.shift(1))].date
holidays[holidays.date.isin(holidays_locale_overlap_days)]
| date | type | locale | locale_name | description | transferred | city | state |
|---|
This empty frame means that, if the separate modeling of local and regional holidays proves too noisy, it would also be possible to aggregate both into a single subnational holiday feature in contrast to the national holidays. But for now I start with modeling all three levels separately. We start with the local level and look at the overlaps:
# With that knowledge, now we can merge the frames separately for local, regional and national holidays
local_holidays = holidays[holidays.locale=="Local"]
local_overlap = local_holidays[(local_holidays.date==local_holidays.date.shift(1)) &
(local_holidays.locale_name==local_holidays.locale_name.shift(1))].date
local_holidays[(local_holidays.date.isin(local_overlap))]
| date | type | locale | locale_name | description | transferred | city | state | |
|---|---|---|---|---|---|---|---|---|
| 264 | 2016-07-24 | Additional | Local | Guayaquil | Fundacion de Guayaquil-1 | False | Guayaquil | Guayas |
| 265 | 2016-07-24 | Transfer | Local | Guayaquil | Traslado Fundacion de Guayaquil | False | Guayaquil | Guayas |
On the regional level, we do not find any overlaps at all:
regional_holidays = holidays[holidays.locale=="Regional"]
regional_overlap = regional_holidays[regional_holidays.date==regional_holidays.date.shift(1)].date
regional_holidays[(regional_holidays.date.isin(regional_overlap))]
| date | type | locale | locale_name | description | transferred | city | state |
|---|
But on the national level, we do (excluding the types Event and Work Day, since they need to be treated separately):
national_holidays = holidays[(holidays.locale=="National") & (~holidays.type.isin(["Event", "Work Day"]))]
national_overlap = national_holidays[national_holidays.date==national_holidays.date.shift(1)].date
national_holidays[(national_holidays.date.isin(national_overlap))]
| date | type | locale | locale_name | description | transferred | city | state | |
|---|---|---|---|---|---|---|---|---|
| 35 | 2012-12-24 | Bridge | National | Ecuador | Puente Navidad | False | None | None |
| 36 | 2012-12-24 | Additional | National | Ecuador | Navidad-1 | False | None | None |
| 39 | 2012-12-31 | Bridge | National | Ecuador | Puente Primer dia del ano | False | None | None |
| 40 | 2012-12-31 | Additional | National | Ecuador | Primer dia del ano-1 | False | None | None |
| 156 | 2014-12-26 | Bridge | National | Ecuador | Puente Navidad | False | None | None |
| 157 | 2014-12-26 | Additional | National | Ecuador | Navidad+1 | False | None | None |
This duplication of dates is problematic both for a merge-based approach (which would lead to row duplication) and for a lookup-based approach (which requires unique keys).
To address this, we perform a deduplication step, prioritizing the type values (Transfer > Holiday > Bridge > Additional) to ensure each (date, city/state/country) combination only occurs once in the final merged dataset.
This prioritization is valid because overlaps of different type values within the same locale always refer to the same underlying holiday group. For example, on 2016-07-24, the local holiday celebrating the founding of Guayaquil was transferred to that date, which had originally been marked as an Additional holiday to precede the same celebration. Similarly, on the national level, overlaps also occur only between Bridge and Additional types, and always near the same holiday.
From this, I inferred that Bridge holidays are treated more like full days off than Additional ones and therefore deserve priority in the deduplication logic.
Since holidays of the type Holiday can also refer to dates on which a holiday was originally scheduled but later transferred, this type comes after Transfer, which always indicates a fully observed day off.
Moreover, bridging or adding a holiday on a date where a holiday was originally scheduled does not make sense—since in that case, the holiday could have simply remained on that date.
The Event type, however, must be treated differently: we want to retain all such entries, since multiple events can happen on the same day, and their combined effects (e.g., Mother’s Day and the earthquake) should be modeled explicitly.
Since the effect of a nationwide compensating Work Day is essentially the opposite of a holiday, and because it may interact with other holidays at the regional or local level, we treat these cases separately as well.
We do not include the types Event and Work Day in this order, as we will treat them separately.
from pandas.api.types import CategoricalDtype
priority_order = ["Transfer", "Holiday", "Bridge", "Additional"]
type_cat = CategoricalDtype(categories=priority_order, ordered=True)
def deduplicate_and_rename_holidays(df: pd.DataFrame, key_cols: list, rename_prefix: str) -> pd.DataFrame:
"""
Keeps only the type with the highest priority in case that in a locale with the same name
a holiday appears twice with different types. Also renames the relevant columns type, description
and transferred
"""
df = df.copy()
df["type"] = df["type"].astype(type_cat)
df = df.sort_values(by=key_cols + ["type"]) # highest priority comes first
df = df.drop_duplicates(subset=key_cols, keep="first")
rename_cols = ["type", "description", "transferred"]
return df.rename(columns={col: f"{rename_prefix}_{col}" for col in rename_cols})
# apply in a loop:
duplicate_dfs = {"local": local_holidays,
"regional": regional_holidays,
"national": national_holidays}
deduplicated_dfs = []
for key, value in duplicate_dfs.items():
deduplicated_dfs.append(deduplicate_and_rename_holidays(duplicate_dfs[key], ["date", "locale_name"], key))
local_holidays_deduplicated, regional_holidays_deduplicated, national_holidays_deduplicated = deduplicated_dfs
# now the same date only appears once:
local_holidays_deduplicated[(local_holidays_deduplicated.date.isin(local_overlap))]
| date | local_type | locale | locale_name | local_description | local_transferred | city | state | |
|---|---|---|---|---|---|---|---|---|
| 265 | 2016-07-24 | Transfer | Local | Guayaquil | Traslado Fundacion de Guayaquil | False | Guayaquil | Guayas |
As intended by the precedence order, only the overlaps of type Transfer at the local level remain. Since there are no regional overlaps and we used the function only to rename the relevant columns, we still only need to verify that the function works correctly on national_holidays_deduplicated, ensuring that rows of type Additional are no longer included in the overlaps.
national_holidays_deduplicated[national_holidays_deduplicated.date.isin(national_overlap)]
| date | national_type | locale | locale_name | national_description | national_transferred | city | state | |
|---|---|---|---|---|---|---|---|---|
| 35 | 2012-12-24 | Bridge | National | Ecuador | Puente Navidad | False | None | None |
| 39 | 2012-12-31 | Bridge | National | Ecuador | Puente Primer dia del ano | False | None | None |
| 156 | 2014-12-26 | Bridge | National | Ecuador | Puente Navidad | False | None | None |
We can now start to create a calendar with the correct combinations of local, regional, and national holidays or events.
Creating a Global Calendar Table ↑¶
At first, it might seem sufficient to join the deduplicated datasets with their locale level in sales__after_stores—and that was my initial approach. However, since holiday effects often extend to days before and after the event, we must calculate the distance in days to and since each holiday. This requires a calendar that extends beyond the training window (for example, our window starts on 1 January 2013, when sales are still affected by Christmas and New Year).
We then create a skeleton of unique date–locale_id combinations. Stores in cities without local holidays do not appear in the holidays table, so their calendars only need regional and national events, merged at the state level. The locale_id column handles this by adding a postfix, since some states and cities share the same name.
# create unique city-date combinations
holiday_cities = sorted([city for city in holidays.city.unique() if city!="None"])
holiday_range = pd.date_range(holidays.date.min(), holidays.date.max())
date_cities = pd.DataFrame(data = [[date, city, f"{city}_city"]
for city in holiday_cities
for date in holiday_range], # full range for each city
columns = ["date", "city", "locale_id"])
# dict to map states by cities
states_by_cities = stores_df[["city", "state"]].drop_duplicates().set_index("city").squeeze().to_dict()
date_cities["state"] = date_cities["city"].map(states_by_cities)
# create unique state-date combinations
holiday_states = sorted([state for state in holidays.state.unique() if state!="None"])
date_states = pd.DataFrame(data = [[date, state, f"{state}_state"]
for state in holiday_states
for date in holiday_range], # full range for each state
columns = ["date", "state", "locale_id"])
date_table = pd.concat([date_cities, date_states], axis=0).fillna("None")
date_table
| date | city | locale_id | state | |
|---|---|---|---|---|
| 0 | 2012-03-02 | Ambato | Ambato_city | Tungurahua |
| 1 | 2012-03-03 | Ambato | Ambato_city | Tungurahua |
| 2 | 2012-03-04 | Ambato | Ambato_city | Tungurahua |
| 3 | 2012-03-05 | Ambato | Ambato_city | Tungurahua |
| 4 | 2012-03-06 | Ambato | Ambato_city | Tungurahua |
| ... | ... | ... | ... | ... |
| 34011 | 2017-12-22 | None | Tungurahua_state | Tungurahua |
| 34012 | 2017-12-23 | None | Tungurahua_state | Tungurahua |
| 34013 | 2017-12-24 | None | Tungurahua_state | Tungurahua |
| 34014 | 2017-12-25 | None | Tungurahua_state | Tungurahua |
| 34015 | 2017-12-26 | None | Tungurahua_state | Tungurahua |
74410 rows × 4 columns
We now add information about holidays and events to these unique calendars. We begin with the events, as they are always defined at the national locale level. First, we take a quick look at all events:
holidays.loc[holidays.type=="Event", ["date", "type", "description"]]
| date | type | description | |
|---|---|---|---|
| 55 | 2013-05-12 | Event | Dia de la Madre |
| 103 | 2014-05-11 | Event | Dia de la Madre |
| 106 | 2014-06-12 | Event | Inauguracion Mundial de futbol Brasil |
| 107 | 2014-06-15 | Event | Mundial de futbol Brasil: Ecuador-Suiza |
| 108 | 2014-06-20 | Event | Mundial de futbol Brasil: Ecuador-Honduras |
| 113 | 2014-06-25 | Event | Mundial de futbol Brasil: Ecuador-Francia |
| 114 | 2014-06-28 | Event | Mundial de futbol Brasil: Octavos de Final |
| 115 | 2014-06-29 | Event | Mundial de futbol Brasil: Octavos de Final |
| 116 | 2014-06-30 | Event | Mundial de futbol Brasil: Octavos de Final |
| 117 | 2014-07-01 | Event | Mundial de futbol Brasil: Octavos de Final |
| 120 | 2014-07-04 | Event | Mundial de futbol Brasil: Cuartos de Final |
| 121 | 2014-07-05 | Event | Mundial de futbol Brasil: Cuartos de Final |
| 122 | 2014-07-08 | Event | Mundial de futbol Brasil: Semifinales |
| 123 | 2014-07-09 | Event | Mundial de futbol Brasil: Semifinales |
| 124 | 2014-07-12 | Event | Mundial de futbol Brasil: Tercer y cuarto lugar |
| 125 | 2014-07-13 | Event | Mundial de futbol Brasil: Final |
| 144 | 2014-11-28 | Event | Black Friday |
| 145 | 2014-12-01 | Event | Cyber Monday |
| 172 | 2015-05-10 | Event | Dia de la Madre |
| 198 | 2015-11-27 | Event | Black Friday |
| 199 | 2015-11-30 | Event | Cyber Monday |
| 219 | 2016-04-16 | Event | Terremoto Manabi |
| 220 | 2016-04-17 | Event | Terremoto Manabi+1 |
| 221 | 2016-04-18 | Event | Terremoto Manabi+2 |
| 222 | 2016-04-19 | Event | Terremoto Manabi+3 |
| 223 | 2016-04-20 | Event | Terremoto Manabi+4 |
| 225 | 2016-04-21 | Event | Terremoto Manabi+5 |
| 226 | 2016-04-22 | Event | Terremoto Manabi+6 |
| 227 | 2016-04-23 | Event | Terremoto Manabi+7 |
| 228 | 2016-04-24 | Event | Terremoto Manabi+8 |
| 229 | 2016-04-25 | Event | Terremoto Manabi+9 |
| 230 | 2016-04-26 | Event | Terremoto Manabi+10 |
| 231 | 2016-04-27 | Event | Terremoto Manabi+11 |
| 232 | 2016-04-28 | Event | Terremoto Manabi+12 |
| 233 | 2016-04-29 | Event | Terremoto Manabi+13 |
| 234 | 2016-04-30 | Event | Terremoto Manabi+14 |
| 236 | 2016-05-01 | Event | Terremoto Manabi+15 |
| 237 | 2016-05-02 | Event | Terremoto Manabi+16 |
| 238 | 2016-05-03 | Event | Terremoto Manabi+17 |
| 239 | 2016-05-04 | Event | Terremoto Manabi+18 |
| 240 | 2016-05-05 | Event | Terremoto Manabi+19 |
| 241 | 2016-05-06 | Event | Terremoto Manabi+20 |
| 243 | 2016-05-07 | Event | Terremoto Manabi+21 |
| 244 | 2016-05-08 | Event | Terremoto Manabi+22 |
| 245 | 2016-05-08 | Event | Dia de la Madre |
| 246 | 2016-05-09 | Event | Terremoto Manabi+23 |
| 247 | 2016-05-10 | Event | Terremoto Manabi+24 |
| 248 | 2016-05-11 | Event | Terremoto Manabi+25 |
| 250 | 2016-05-12 | Event | Terremoto Manabi+26 |
| 251 | 2016-05-13 | Event | Terremoto Manabi+27 |
| 252 | 2016-05-14 | Event | Terremoto Manabi+28 |
| 253 | 2016-05-15 | Event | Terremoto Manabi+29 |
| 254 | 2016-05-16 | Event | Terremoto Manabi+30 |
| 284 | 2016-11-25 | Event | Black Friday |
| 285 | 2016-11-28 | Event | Cyber Monday |
| 311 | 2017-05-14 | Event | Dia de la Madre |
We find recurring events such as Mother's Day, Black Friday and Cyber Monday as well as one-time event such as the Soccer World Cup matches in Brazil in 2014 and the Earthquake in Manabí and its aftermath in April and May 2016.
We create boolean flags for each event and for cases a Saturday was redeclared a workday. For the Soccer World Cup, we also add a flag indicating whether Ecuador played and a column that encodes the stage of each listed match (from 0 for the innauguration to 6 for the final). For the Earthquake, the flag only marks the day it occured, as we will later test how long its impact on sales lasted.
event_descriptions = holidays.loc[holidays.type=="Event", "description"].unique()
unique_events = set( # reduce to unique elements ("Mundial" and "Terremoto" appear in similar descriptions)
["Mundial" if "Mundial" in event else # for the soccer world cup in Brasil
"Terremoto Manabi" if "Terremoto" in event else # for the earthquake in Manabi
event for event in event_descriptions]
)
# create a dictionary with corresponding dates for each event
event_dates = {event:
holidays.loc[holidays.description.str.contains(event), "date"].values if event == "Mundial" else
holidays.loc[holidays.description==event, "date"].values # exact match for exclusion of Terremoto Manabi+1...
for event in sorted(unique_events)}
for event, dates in event_dates.items():
date_table[f"is_{event}"] = date_table["date"].isin(dates)
# assign an own variable for matches with Ecuador involved
mundial_ecuador_descriptions = list(filter(lambda x: "Ecuador" in x, event_descriptions))
mundial_ecuador_dates = holidays.loc[holidays.description.isin(mundial_ecuador_descriptions), "date"]
date_table["is_mundial_ecuador"] = date_table.date.isin(mundial_ecuador_dates)
# add a column with increasing numbers for each stage in the world cup
mundial_df = holidays.loc[holidays["description"].str.contains("Mundial"), ["date", "description"]]
mundial_df["stage"] = [
0 if "Inauguracion" in desc else
1 if "Ecuador" in desc else # only group games with Ecuador are listed
2 if "Octavos" in desc else # round of last 16
3 if "Cuartos" in desc else # quarter-finals
4 if "Semifinales" in desc else # semi-final
5 if "Tercer y cuarto lugar" in desc else # match for 3rd and 4th places
6 # final
for desc in mundial_df["description"]
]
mundial_stage_map = mundial_df[["date", "stage"]].set_index("date").squeeze().to_dict()
date_table["mundial_stage"] = date_table["date"].map(mundial_stage_map).fillna(-1).astype(int) # -1 for no match days
# add Saturdays that were declared to weekdays for recompensation
workday_dates = holidays.loc[holidays.type=="Work Day", "date"]
date_table["is_workday"] = date_table.date.isin(workday_dates)
# replace spaces with and underscore for consistent naming
date_table.columns = date_table.columns.str.replace(" ", "_").str.lower()
date_table.head()
| date | city | locale_id | state | is_black_friday | is_cyber_monday | is_dia_de_la_madre | is_mundial | is_terremoto_manabi | is_mundial_ecuador | mundial_stage | is_workday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Ambato | Ambato_city | Tungurahua | False | False | False | False | False | False | -1 | False |
| 1 | 2012-03-03 | Ambato | Ambato_city | Tungurahua | False | False | False | False | False | False | -1 | False |
| 2 | 2012-03-04 | Ambato | Ambato_city | Tungurahua | False | False | False | False | False | False | -1 | False |
| 3 | 2012-03-05 | Ambato | Ambato_city | Tungurahua | False | False | False | False | False | False | -1 | False |
| 4 | 2012-03-06 | Ambato | Ambato_city | Tungurahua | False | False | False | False | False | False | -1 | False |
We now merge the deduplicated holiday information for each city and state with the date_table, resulting in columns for the holiday type, holiday description, and transferred flag at each locale level:
holidays_dfs = {"local": (local_holidays_deduplicated, ["date", "city"]),
"regional": (regional_holidays_deduplicated, ["date", "state"]),
"national": (national_holidays_deduplicated, ["date"])}
def merge_holidays(df: pd.DataFrame, merge_dfs: dict[tuple[pd.DataFrame, list[str]]]) -> pd.DataFrame:
for key, value in merge_dfs.items():
# keep only merge cols and those with relevant info for this locale (type, description, transferred)
cols = value[1] + [col for col in value[0] if col.startswith(f"{key}_")]
df = df.merge(value[0][cols], on=value[1], how="left")
return df
date_table = merge_holidays(df=date_table, merge_dfs=holidays_dfs)
# example
date_table.loc[
date_table.date.between("2013-12-25", "2014-01-01"),
["date", "locale_id"] + [
col for col in date_table.columns
if any(key in col for key in ("type", "description", "transferred"))
]] # df.filter(regex="date|locale_id|type...) would have worked too, but this is library-agnostic
| date | locale_id | local_type | local_description | local_transferred | regional_type | regional_description | regional_transferred | national_type | national_description | national_transferred | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 663 | 2013-12-25 | Ambato_city | NaN | NaN | NaN | NaN | NaN | NaN | Holiday | Navidad | False |
| 664 | 2013-12-26 | Ambato_city | NaN | NaN | NaN | NaN | NaN | NaN | Additional | Navidad+1 | False |
| 665 | 2013-12-27 | Ambato_city | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 666 | 2013-12-28 | Ambato_city | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 667 | 2013-12-29 | Ambato_city | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72950 | 2013-12-28 | Tungurahua_state | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 72951 | 2013-12-29 | Tungurahua_state | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 72952 | 2013-12-30 | Tungurahua_state | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 72953 | 2013-12-31 | Tungurahua_state | NaN | NaN | NaN | NaN | NaN | NaN | Additional | Primer dia del ano-1 | False |
| 72954 | 2014-01-01 | Tungurahua_state | NaN | NaN | NaN | NaN | NaN | NaN | Holiday | Primer dia del ano | False |
280 rows × 11 columns
Now we can combine the information from the holiday type and the transferred columns into a single column for each locale level. Days marked as Holiday where transferred = True receive the label Transferred Holiday, while unlabeled days are assigned the type No Holiday. This makes it straightforward to create a flag is_any_holiday which indicates whether a date is a holiday on any locale level.
locale_levels = ["local", "regional", "national"]
# new priority order
priority_order = ["Holiday", "Transfer", "Bridge", "Additional", "Transferred Holiday", "No Holiday"]
type_cat = CategoricalDtype(categories=priority_order, ordered=True)
# condense info from type cols and transferred cols into one col respectively
for level in locale_levels:
date_table[f"{level}_type"] = pd.Series(
np.where(date_table[f"{level}_transferred"]==True,
"Transferred Holiday",
date_table[f"{level}_type"])
).fillna("No Holiday").astype(type_cat)
date_table[f"is_{level}_holiday"] = np.where(date_table[f"{level}_type"].isin(["Holiday", "Transfer"]),
True,
False)
# create column if there is a holiday on any level on this date
date_table["is_any_holiday"] = date_table[
["is_local_holiday", "is_regional_holiday", "is_national_holiday"]
].any(axis=1)
# example
date_table.loc[
date_table.date.between("2013-12-25", "2014-01-01"),
["date", "locale_id"] + [col for col in date_table.columns if any(key in col for key in ("holiday", "type"))]
]
| date | locale_id | local_type | regional_type | national_type | is_local_holiday | is_regional_holiday | is_national_holiday | is_any_holiday | |
|---|---|---|---|---|---|---|---|---|---|
| 663 | 2013-12-25 | Ambato_city | No Holiday | No Holiday | Holiday | False | False | True | True |
| 664 | 2013-12-26 | Ambato_city | No Holiday | No Holiday | Additional | False | False | False | False |
| 665 | 2013-12-27 | Ambato_city | No Holiday | No Holiday | No Holiday | False | False | False | False |
| 666 | 2013-12-28 | Ambato_city | No Holiday | No Holiday | No Holiday | False | False | False | False |
| 667 | 2013-12-29 | Ambato_city | No Holiday | No Holiday | No Holiday | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72950 | 2013-12-28 | Tungurahua_state | No Holiday | No Holiday | No Holiday | False | False | False | False |
| 72951 | 2013-12-29 | Tungurahua_state | No Holiday | No Holiday | No Holiday | False | False | False | False |
| 72952 | 2013-12-30 | Tungurahua_state | No Holiday | No Holiday | No Holiday | False | False | False | False |
| 72953 | 2013-12-31 | Tungurahua_state | No Holiday | No Holiday | Additional | False | False | False | False |
| 72954 | 2014-01-01 | Tungurahua_state | No Holiday | No Holiday | Holiday | False | False | True | True |
280 rows × 9 columns
Since we can already anticipate that Christmas—and possibly Easter (which follows immediately Good Friday) and New Year—will stand out among all holidays in a Christian country, we create special flags for them:
date_table["is_christmas"] = date_table.date.dt.strftime("%m-%d")=="12-25"
viernes_santos = holidays.loc[holidays.description=="Viernes Santo", "date"]
date_table["is_viernes_santo"] = date_table.date.isin(viernes_santos)
date_table["is_new_year"] = date_table.date.dt.strftime("%m-%d")=="01-01"
# example
date_table.loc[
date_table.date.between("2013-12-25", "2014-01-01"),
["date", "locale_id", "is_christmas", "is_viernes_santo", "is_new_year"]]
| date | locale_id | is_christmas | is_viernes_santo | is_new_year | |
|---|---|---|---|---|---|
| 663 | 2013-12-25 | Ambato_city | True | False | False |
| 664 | 2013-12-26 | Ambato_city | False | False | False |
| 665 | 2013-12-27 | Ambato_city | False | False | False |
| 666 | 2013-12-28 | Ambato_city | False | False | False |
| 667 | 2013-12-29 | Ambato_city | False | False | False |
| ... | ... | ... | ... | ... | ... |
| 72950 | 2013-12-28 | Tungurahua_state | False | False | False |
| 72951 | 2013-12-29 | Tungurahua_state | False | False | False |
| 72952 | 2013-12-30 | Tungurahua_state | False | False | False |
| 72953 | 2013-12-31 | Tungurahua_state | False | False | False |
| 72954 | 2014-01-01 | Tungurahua_state | False | False | True |
280 rows × 5 columns
Rather than keeping multiple categorical holiday flags, we encode holidays and events as a signed distance feature:
0on the holiday itself-kdays before (anticipation effect)+kdays after (recovery effect)
This condenses the full dynamic into a single numeric variable. To compute it, I implemented a vectorized NumPy function that calculates the days since the last event and the days until the next event, then combines them into one signed distance. Negative values represent anticipation, positive values recovery. In case of a tie, the function prefers negative values, reflecting the assumption that anticipation usually outweighs recovery.
Below, a toy example with two groups shows the idea, followed by the implementation.
def signed_dist_to_event(df: pd.DataFrame, event_col: str, num_groups: int, prefer_before: bool=True, clip: int=None):
"""
Vectorized computation of signed distance to the nearest event per group.
Features:
- Uses NumPy broadcasting to handle all groups in one pass.
- Forward and reversed cumulative scans capture both past and future events.
- Tie-break rule (default: prefer negative values) encodes anticipation > recovery.
Preconditions:
- Data must be sorted by group and date.
- Each group must cover the same contiguous date range.
Returns
-------
np.ndarray
1D array of signed distances, aligned with the input order.
"""
series = df[event_col]
a = series.astype(bool).to_numpy().reshape(num_groups, -1)
group_len = a.shape[1]
idx = np.tile(np.arange(group_len), reps=[num_groups, 1]) # 0 to n-1 in each row
idx_rev = idx[:, ::-1] # reversed idx (n-1 to 0)
prev_idx = np.where(a, idx, -1) # index of actual holiday for each position (-1 else)
prev_idx = np.maximum.accumulate(prev_idx, axis=1) # values before the first event remain -1
days_since = np.where(prev_idx < 0, idx + 1, # treats start like 1 day after an event
idx - prev_idx) # 0 on event, otherwise positive dist from last event
next_idx = np.where(a, idx_rev, -1) # same logic as above, just backwards
next_idx = np.maximum.accumulate(next_idx[:, ::-1], axis=1)[:, ::-1] # values after the last event remain -1
days_to = np.where(next_idx < 0, idx_rev + 1, # treats end like 1 day before an event
idx_rev - next_idx) # 0 on event, otherwise positive dist to next event
if prefer_before:
# use distance from last event when:
# at the end (next_idx<0) or
# not at the start (prev_idx>0) and distance from last event is less than distance to next event
signed = np.where((next_idx < 0) | (days_since < days_to) & (prev_idx > 0),
days_since,
-days_to) # negative values to next event at the start and when days_since==days_to
else:
signed = np.where((prev_idx < 0) | (days_to < days_since) & (next_idx > 0),
-days_to,
days_since)
if clip is not None:
signed = np.clip(signed, -clip, clip)
return signed.astype(int).flatten()
# creating a toy df for exemplification
toy_holiday = pd.DataFrame({
"date": pd.date_range("2021-12-20", periods=10).tolist()*2,
"group": ["A"]*10 + ["B"]*10,
"is_holiday": [0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 1, 0]
})
toy_holiday["signed_dist_to_holiday"] = signed_dist_to_event(toy_holiday, "is_holiday", num_groups=2)
toy_holiday
| date | group | is_holiday | signed_dist_to_holiday | |
|---|---|---|---|---|
| 0 | 2021-12-20 | A | 0 | -2 |
| 1 | 2021-12-21 | A | 0 | -1 |
| 2 | 2021-12-22 | A | 1 | 0 |
| 3 | 2021-12-23 | A | 0 | 1 |
| 4 | 2021-12-24 | A | 0 | -2 |
| 5 | 2021-12-25 | A | 0 | -1 |
| 6 | 2021-12-26 | A | 1 | 0 |
| 7 | 2021-12-27 | A | 0 | 1 |
| 8 | 2021-12-28 | A | 0 | 2 |
| 9 | 2021-12-29 | A | 0 | 3 |
| 10 | 2021-12-20 | B | 0 | -1 |
| 11 | 2021-12-21 | B | 1 | 0 |
| 12 | 2021-12-22 | B | 0 | 1 |
| 13 | 2021-12-23 | B | 0 | -1 |
| 14 | 2021-12-24 | B | 1 | 0 |
| 15 | 2021-12-25 | B | 0 | 1 |
| 16 | 2021-12-26 | B | 0 | -2 |
| 17 | 2021-12-27 | B | 0 | -1 |
| 18 | 2021-12-28 | B | 1 | 0 |
| 19 | 2021-12-29 | B | 0 | 1 |
We now apply this function to create raw distance columns in our calendar for every boolean flag column. During data exploration, we will test whether the raw distances, a clipped version, or a transformed version are more beneficial, depending on the model at hand. Tree-based models are unaffected by clipping, while neural networks may benefit from transformations such as the Gaussian-odd.
import re
# create distances
flags = [col for col in date_table if col.startswith("is")]
pattern = re.compile(r"is_(\w*)")
date_table_groups = date_table.date.value_counts().iloc[0] # same number of occurences due to table construction method
for flag in flags:
new_col_name = f"dist_{re.search(pattern, flag).group(1)}"
date_table[new_col_name] = signed_dist_to_event(date_table, event_col=flag, num_groups=date_table_groups)
# remove binary flags
#date_table = date_table.drop([col for col in date_table.columns if col.startswith("is")], axis=1)
date_table[["date", "locale_id"] + [col for col in date_table.columns if col.startswith("dist")]] # example
| date | locale_id | dist_black_friday | dist_cyber_monday | dist_dia_de_la_madre | dist_mundial | dist_terremoto_manabi | dist_mundial_ecuador | dist_workday | dist_local_holiday | dist_regional_holiday | dist_national_holiday | dist_any_holiday | dist_christmas | dist_viernes_santo | dist_new_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Ambato_city | -1001 | -1004 | -436 | -832 | -1506 | -835 | -309 | -175 | 1 | -161 | -161 | -298 | -423 | -305 |
| 1 | 2012-03-03 | Ambato_city | -1000 | -1003 | -435 | -831 | -1505 | -834 | -308 | -174 | 2 | -160 | -160 | -297 | -422 | -304 |
| 2 | 2012-03-04 | Ambato_city | -999 | -1002 | -434 | -830 | -1504 | -833 | -307 | -173 | 3 | -159 | -159 | -296 | -421 | -303 |
| 3 | 2012-03-05 | Ambato_city | -998 | -1001 | -433 | -829 | -1503 | -832 | -306 | -172 | 4 | -158 | -158 | -295 | -420 | -302 |
| 4 | 2012-03-06 | Ambato_city | -997 | -1000 | -432 | -828 | -1502 | -831 | -305 | -171 | 5 | -157 | -157 | -294 | -419 | -301 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 74405 | 2017-12-22 | Tungurahua_state | 392 | 389 | 222 | 1258 | 615 | 1276 | 405 | 2122 | 2122 | -3 | -3 | -3 | 252 | 355 |
| 74406 | 2017-12-23 | Tungurahua_state | 393 | 390 | 223 | 1259 | 616 | 1277 | 406 | 2123 | 2123 | -2 | -2 | -2 | 253 | 356 |
| 74407 | 2017-12-24 | Tungurahua_state | 394 | 391 | 224 | 1260 | 617 | 1278 | 407 | 2124 | 2124 | -1 | -1 | -1 | 254 | 357 |
| 74408 | 2017-12-25 | Tungurahua_state | 395 | 392 | 225 | 1261 | 618 | 1279 | 408 | 2125 | 2125 | 0 | 0 | 0 | 255 | 358 |
| 74409 | 2017-12-26 | Tungurahua_state | 396 | 393 | 226 | 1262 | 619 | 1280 | 409 | 2126 | 2126 | 1 | 1 | 1 | 256 | 359 |
74410 rows × 16 columns
Since we always know the weekday, calendar day, month, day of year and year in advance, we can add those information as separate columns to our calendars in date_table as well. We also compute a column that counts the number of days elapsed since the first day of date_table (the same across all locale IDs), as this information is also known in advance and the choice of the starting point does not matter.
import calendar
# fetch weekday and month names
weekdays = list(calendar.day_name)
months = list(calendar.month_name)[1:] # first element is an empty string
# extract some general information contained in the date column
date_table["weekday"] = pd.Categorical(date_table.date.dt.day_name(),
categories=weekdays,
ordered=True)
date_table["day"] = date_table.date.dt.day
date_table["month"] = pd.Categorical(date_table.date.dt.month_name(),
categories=months,
ordered=True)
date_table["day_of_year"] = date_table.date.dt.day_of_year
date_table["year"] = date_table.date.dt.year
date_table["is_leap_year"] = (
(date_table["year"] % 4 == 0) & (date_table["year"] % 100 != 0) # divisible by 4 -> leap, but not if divisble by 100
) | (date_table["year"] % 400 == 0) # all years divisible by 400 are leap years
date_table["days_elapsed"] = ((date_table.date - pd.to_datetime(date_table.date.min()))
/np.timedelta64(1, "D")
).astype(int)
date_table[["date", "locale_id", "weekday", "day", "month", "year", "is_leap_year", "days_elapsed"]]
| date | locale_id | weekday | day | month | year | is_leap_year | days_elapsed | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Ambato_city | Friday | 2 | March | 2012 | True | 0 |
| 1 | 2012-03-03 | Ambato_city | Saturday | 3 | March | 2012 | True | 1 |
| 2 | 2012-03-04 | Ambato_city | Sunday | 4 | March | 2012 | True | 2 |
| 3 | 2012-03-05 | Ambato_city | Monday | 5 | March | 2012 | True | 3 |
| 4 | 2012-03-06 | Ambato_city | Tuesday | 6 | March | 2012 | True | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 74405 | 2017-12-22 | Tungurahua_state | Friday | 22 | December | 2017 | False | 2121 |
| 74406 | 2017-12-23 | Tungurahua_state | Saturday | 23 | December | 2017 | False | 2122 |
| 74407 | 2017-12-24 | Tungurahua_state | Sunday | 24 | December | 2017 | False | 2123 |
| 74408 | 2017-12-25 | Tungurahua_state | Monday | 25 | December | 2017 | False | 2124 |
| 74409 | 2017-12-26 | Tungurahua_state | Tuesday | 26 | December | 2017 | False | 2125 |
74410 rows × 8 columns
All we have to do now is create a locale_id column in sales__after_stores that follows the same logic as its counterpart in date_table and merge the DataFrames yielding a large new DataFrame favoritas.
# create column for merge
sales__after_stores["locale_id"] = np.where(
sales__after_stores["city"].isin(holiday_cities),
sales__after_stores["city"] + "_city",
sales__after_stores["state"] + "_state"
)
favoritas = sales__after_stores.merge(
date_table[[col for col in date_table.columns if col not in ("city", "state")]], # city & state already in sales__...
on=["date", "locale_id"],
how="left"
)
favoritas
| date | store_nbr | family | id | sales | onpromotion | series_id | city | state | type | ... | dist_christmas | dist_viernes_santo | dist_new_year | weekday | day | month | day_of_year | year | is_leap_year | days_elapsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | ... | 7 | -118 | 0 | Tuesday | 1 | January | 1 | 2013 | False | 305 |
| 1 | 2013-01-02 | 1 | AUTOMOTIVE | 1782.0 | 2.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | ... | 8 | -117 | 1 | Wednesday | 2 | January | 2 | 2013 | False | 306 |
| 2 | 2013-01-03 | 1 | AUTOMOTIVE | 3564.0 | 3.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | ... | 9 | -116 | 2 | Thursday | 3 | January | 3 | 2013 | False | 307 |
| 3 | 2013-01-04 | 1 | AUTOMOTIVE | 5346.0 | 3.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | ... | 10 | -115 | 3 | Friday | 4 | January | 4 | 2013 | False | 308 |
| 4 | 2013-01-05 | 1 | AUTOMOTIVE | 7128.0 | 5.000000 | 0.0 | 1_AUTOMOTIVE | Quito | Pichincha | D | ... | 11 | -114 | 4 | Saturday | 5 | January | 5 | 2013 | False | 309 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3008011 | 2017-08-11 | 9 | SEAFOOD | 2993759.0 | 23.831000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | ... | -136 | 119 | 222 | Friday | 11 | August | 223 | 2017 | False | 1988 |
| 3008012 | 2017-08-12 | 9 | SEAFOOD | 2995541.0 | 16.859001 | 4.0 | 9_SEAFOOD | Quito | Pichincha | B | ... | -135 | 120 | 223 | Saturday | 12 | August | 224 | 2017 | False | 1989 |
| 3008013 | 2017-08-13 | 9 | SEAFOOD | 2997323.0 | 20.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | ... | -134 | 121 | 224 | Sunday | 13 | August | 225 | 2017 | False | 1990 |
| 3008014 | 2017-08-14 | 9 | SEAFOOD | 2999105.0 | 17.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | ... | -133 | 122 | 225 | Monday | 14 | August | 226 | 2017 | False | 1991 |
| 3008015 | 2017-08-15 | 9 | SEAFOOD | 3000887.0 | 16.000000 | 0.0 | 9_SEAFOOD | Quito | Pichincha | B | ... | -132 | 123 | 226 | Tuesday | 15 | August | 227 | 2017 | False | 1992 |
3008016 rows × 57 columns
The number of rows is still the same as in sales__after_stores, so no duplicate dates were introduced.
Although provided as a data table, the oil price would not be known in advance in a real production setting. Including it would require uncertain forecasts, since oil prices are highly sensitive to unexpected events, and this could add instability to the models. Therefore, I decided not to include it, focusing instead on holidays and promotions, which are realistic forward-known features.
With this, we now have all columns needed for feature engineering in a single DataFrame.
Note that all the steps we performed so far—such as assigning zero sales to public holidays like 25 December or resolving overlaps in holiday types—should ideally be handled directly in the database level. These operations can be implemented in SQL scripts, so that the Python layer focuses on data analysis and machine learning pipelines. While pandas can efficiently perform joins, we deliberately avoid such operations within an sklearn pipeline to keep modeling logic isolated from raw data preparation, ensuring robustness and maintainability.
Therefore, our ML pipeline will consist only of modeling-related steps from here on.
But first we split the data into a train and test set to avoid data leakage from the test set into the training set. Our task is to predict the sales 14 days in the future. So it makes sense to set the test dataset as the last 14 days of the regarded time window:
train_test_date = favoritas.date.max() + pd.DateOffset(-14 + 1) # + 1 as the max date is already included
train_test_date
train = favoritas[favoritas.date < train_test_date].copy()
test = favoritas[favoritas.date >= train_test_date].copy()
# Print the resulting dates
print("Train Dates:\n", train.date.unique(), "\n\nTest Dates:\n", test.date.unique())
Train Dates: <DatetimeArray> ['2013-01-01 00:00:00', '2013-01-02 00:00:00', '2013-01-03 00:00:00', '2013-01-04 00:00:00', '2013-01-05 00:00:00', '2013-01-06 00:00:00', '2013-01-07 00:00:00', '2013-01-08 00:00:00', '2013-01-09 00:00:00', '2013-01-10 00:00:00', ... '2017-07-23 00:00:00', '2017-07-24 00:00:00', '2017-07-25 00:00:00', '2017-07-26 00:00:00', '2017-07-27 00:00:00', '2017-07-28 00:00:00', '2017-07-29 00:00:00', '2017-07-30 00:00:00', '2017-07-31 00:00:00', '2017-08-01 00:00:00'] Length: 1674, dtype: datetime64[ns] Test Dates: <DatetimeArray> ['2017-08-02 00:00:00', '2017-08-03 00:00:00', '2017-08-04 00:00:00', '2017-08-05 00:00:00', '2017-08-06 00:00:00', '2017-08-07 00:00:00', '2017-08-08 00:00:00', '2017-08-09 00:00:00', '2017-08-10 00:00:00', '2017-08-11 00:00:00', '2017-08-12 00:00:00', '2017-08-13 00:00:00', '2017-08-14 00:00:00', '2017-08-15 00:00:00'] Length: 14, dtype: datetime64[ns]
print("train shape:", train.shape, "\ntest shape:", test.shape)
train shape: (2983068, 57) test shape: (24948, 57)
We see that our training data now has a bit less than 3 million rows (what is still plenty of Data for the models to learn from!) and the rest of the original rows are now saved in the test data with nearly 25k rows.
With over 1,700 different time series (consisting of every combination of store_nbr and family), it is impossible to make appealing visualizations when it comes to the temporal structure like trend and seasonalities. Therefore, our first goal is to create features that allow us to select a reasonable sample. But first, we want to check how our target is distributed:
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
# for visual more appealing plotting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set_style("whitegrid")
plt.figure(figsize=(16, 4))
sns.histplot(train, x="sales", binwidth=100);
plt.title("Raw Distribution of Sales");
This distribution is extremely right-skewed: the vast majority (around two thirds of all nearly 3 million rows in the training data) lies below 100 (be aware of the high scale of the counts). However, some very large values exist, which are not even visible due to the scale compression on the x-axis.
To reveal the distribution of rare but large sales values, we apply a logarithmic scale to the x-axis. This helps us visualize structure across orders of magnitude, which is not visible in the raw histogram:
plt.figure(figsize=(16, 4))#, dpi=200)
sns.histplot(train, x="sales", bins=100, log_scale=10);
plt.title("Log-scaled Distribution of Sales");
This version is a bit harder to interpret at first, why does it look so different?
Each bin spans an equal width in log₁₀(sales) space (e.g., 0.05 log-units), meaning that the actual sales ranges covered by each bin increase multiplicatively — for example:
- The first few bins might cover 1–1.12, 1.12–1.26, 1.26–1.41, ...
- later bins might cover 1000–1122 or 10,000–11,220.
This uneven spacing results in visible gaps, especially in the lower range where values like 1.12–1.26 may occur infrequently or not at all.
What can we learn? At first glance, the right part of the plot resembles a log-normal distribution with a peak between approximately $10^{2.3}$ and $10^{2.4}$ followed by a long, slowly decreasing tail. This suggests that large values are not just outliers, but part of the dataset's inherent structure. However, very high values close to $10^5$ are so rare that they cannot be visibly printed in this log-scaled version either.
Nevertheless, unlike a true log-normal distribution we observe a massive overrepresentation of very low values. The mode of the non-zero distribution appears to be 1 ($10^0$), but in fact:
The most frequent value is zero.
Let us confirm this by examining the top 10 most frequent sales values:
train.sales.value_counts().sort_values(ascending=False).iloc[:10]
sales 0.0 942612 1.0 114295 2.0 85161 3.0 67907 4.0 57321 5.0 49359 6.0 43080 7.0 37802 8.0 33184 9.0 30042 Name: count, dtype: int64
Nearly one-third of the dataset consists of zeros. These are not shown in the log-scaled histogram because (\log(0)) is undefined and excluded automatically by Seaborn/Matplotlib.
While the above value counts suggest that sales is integer-valued, this is not strictly the case. Non-integer values do exist, typically for products sold by weight (e.g., MEATS), but they are far less frequent.
Since each series represents a unique combination of one product family in one store, we take a brief look at the distribution of the mean and standard deviation of these columns (due to a lot of outliers the usually used boxplots are not helpful in this case). We begin with the stores, which we will convert to the appropriate categorical data type as we go along:
import textwrap
def make_barplot(data: pd.DataFrame, x: str, y: str="sales", line_wrap: int=10, wrap: bool=True, labelrotation: int=45):
df = data.groupby(x, observed=False)[y].agg(["mean", "std"]).reset_index()
if wrap:
df[x] = df[x].astype(str).apply(lambda x: "\n".join(textwrap.wrap(x, line_wrap)))
x_label = x.removesuffix("_wrapped").replace("_", " ").title()
else:
x_label = x.replace("_", " ").title()
yerr = [np.zeros(len(df)), df["std"].values]
plt.figure(figsize=(18, 5))
sns.barplot(df, x=x, y="mean")
plt.errorbar(x, y="mean", fmt="none", yerr=yerr, capsize=0, capthick=0, lw=1, data=df)
plt.xlabel(x_label); plt.ylabel(y.title())
plt.tick_params(axis="x", labelrotation=labelrotation)
plt.title(f"Means and Standard Deviations of Sales by {x_label}")
# convert store_nbr
stores_cat = pd.CategoricalDtype(np.sort(stores.astype(int)), ordered=True)
train["store_nbr"] = train["store_nbr"].astype(stores_cat)
make_barplot(train, "store_nbr")
There are large differences in the number of articles sold between stores, both in the mean and the standard deviation—stores having higher means also show higher standard deviations. Most likely stores vary greatly in size. To gain a better understanding, we inspect the store metadata columns from the store table. The type and cluster columns came without a sensible explanation of what kinds of stores they represent. Perhaps we can better grasp these columns, by examining which stores they include. We start with the store type.
# convert type
types_cat = pd.CategoricalDtype(np.sort(train["type"].unique()), ordered=True)
train["type"] = train["type"].astype(types_cat)
def make_grouped_stores_plot(df: pd.DataFrame, col: str, title_grouper: str):
g = sns.catplot(train, x="store_nbr", y="sales", col=col, col_wrap=2, estimator="mean", errorbar=None,
height=2, aspect=3.5, kind="bar");
for ax in g.axes.flat:
ax.tick_params("x", labelsize=8, labelrotation=90)
ax.set_xlabel("Store Nbr"); ax.set_ylabel("Sales")
y_ticks_labels = np.arange(0, 1250, 250)
ax.set_yticks(y_ticks_labels, y_ticks_labels)
g.set_titles(f"{col.title()} " + "{col_name}")
plt.suptitle(f"Mean Sales of Stores grouped by {title_grouper}", y=1.04);
make_grouped_stores_plot(train, "type", "Store Type")
The mean sales alone cannot explain the store type, but we observe:
- Type A appears to represent a few very large stores. Store Number 52 is also such a large store, opened in 2017 near the end of the training window (all product families show zero sales before that).
- Type B consists of medium-sized stores
- Type C seems to be a common type, representing the smallest stores
- Type D is the most frequent type, with varying store sizes
- Type E includes only 4 relatively small stores
While this is not exhaustive information, there does seem to be a connection to stores size. What do the clusters represent?
xtab = pd.crosstab(index=train["cluster"], columns=train["store_nbr"], values=train["sales"], aggfunc="mean")
plt.figure(figsize=(18, 5))
sns.heatmap(xtab, cmap="viridis", square=True)
plt.xlabel("Store Nbr"), plt.ylabel("Cluster")
plt.title("Sales Means by Store Cluster and Store Number");
With 17 different clusters, faceted barplots would be too chaotic. This heatmap shows that many store clusters consist of only very few stores, making it difficult to find meaningufl patterns. Nevertheless, the clusters are clearly distinct from the store types.
The store metadata includes also the city and state, but this information is very redundant: in most states they are stores only in a single city. Exceptions include Guayas with four cities, and Los Rios, Manabi, and Pichincha with two cities each. The city with the most stores is the capital Quito with 18 (one-third of all) stores, followed by Guayaquil with 8 stores. All other cities have up to three stores, usually only one.
While it remains uncertain whether this store metadata will be useful for modeling, we now turn to the second component that defines unique time series: the product family.
make_barplot(train, "family", labelrotation=90, wrap=False)
The differences between product families are striking. Some the mean values are not even plottable, while a few product families such as Grocery I, Beverages, Produce and Cleaning dominate sales almost entirely.
This uneven distribution might also reflect very different scales across time series. To further analyze this, we now compute the mean and standard deviation for each individual time series and visualize their variation.
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
# raw plot
means_stds = train[["sales", "series_id"]].groupby("series_id").agg(["mean", "std"])
means_stds.columns = ["Series Mean", "Series Std"]
#plt.figure(figsize=(6, 4))#, dpi=200)
sns.scatterplot(means_stds, x="Series Mean", y="Series Std", alpha=0.7, ax=axs[0]);
axs[0].set_xlim(-500, 10000);
# log plot
means_stds_log = (train.assign(log_sales=np.log1p(train.sales))[["log_sales", "series_id"]]
.groupby("series_id")
.agg(["mean", "std"]))
means_stds_log.columns = ["mean", "std"]
#plt.figure(figsize=(8, 4))#, dpi=200)
sns.scatterplot(means_stds_log, x="mean", y="std", alpha=0.7, ax=axs[1]);
axs[1].set_xlabel("Series Mean (Log)")
axs[1].set_ylabel("Series Std (Log)")
axs[1].set_xlim(-0.5, 10);
This left scatterplot shows the raw mean and raw standard deviation of each series based on raw sales values. As expected, there is a clear trend: the higher the mean, the higher the standard deviation.
However, what really stands out, is how vast the differences in scale are:
- Some series have means around 0, others go beyond 8000.
- A few very large-scale series dominate the plot, while the bulk of series gets compressed into the lower-left corner.
So while this gives a rough idea of scale differences, most of the smaller series are hard to interpret here. Therefore, I applied np.log1p(sales) (taking the many zeros into account we add one before applying the natural logarithm) before calculating the mean and standard deviation of each series to create the right scatterplot.
This right plot changes the view quite a bit:
- The scale is compressed, which helps unfold the structure among smaller and medium-sized series.
- We can now spot distinct bands or patterns (especially in the lower range), likely caused by sales rounding or different product types.
- Some series still show high variability even with a low log-mean — possibly due to bursts or seasonal peaks.
- This log-transformed version does not just make outliers smaller — it gives us a much better sense of the rich structure and diversity of series behavior hidden in the data.
Both plots confirm the basic trend:
- mean and standard deviation are positively correlated and there are huge differences in the scale. This means that models will benefit from groupwise scaling, as they do not have to learn the different scales.
- But the log-transformed version makes it much easier to fit to the typical series, not just the extreme ones.
- This shows that models are likely to benefit not only from groupwise scaling, but also from applying
np.log1pto sales beforehand: It reduces the skew of both the mean and standard deviation within one series. Without log transformation, a few extreme values can inflate the mean and/or the standard deviation, leading to distorted z-scores — especially for values that are actually typical for that series. Log-transforming first makes the distribution more symmetric, so that the z-scores more accurately reflect relative variation.
This means we need to apply two transformations to the target:
- A logarithmic transformation (
log1p) to reduce skew and compress large values. - Group-wise standardization to normalize the scale of each individual series.
Why not use RobustScaler or QuantileTransformer? (Click here for more Info)
Alternatives like QuantileTransformer or RobustScaler were considered as well.
However, they come with trade-offs:
- With some extreme outliers,
RobustScalerwould still return very large values, due to small IQRs. - Both are harder to invert, which complicates interpreting model outputs.
- They may not suit models like LSTMs or TFTs, where smooth, monotonic scaling is important.
By contrast, applying log1p followed by group-wise standardization:
- Preserves interpretability
- Reduces skew
- Keeps everything pipeline-compatible
- And makes z-scores more meaningful within each series
from sklearn.preprocessing import FunctionTransformer
def log1p_feature_names(transformer, input_features):
return [f"{f}_log1p" for f in input_features]
log1p_transformer = FunctionTransformer(
np.log1p,
inverse_func=np.expm1,
check_inverse=False, # only values >= 0, inverse_func is perfect inverse of func
feature_names_out=log1p_feature_names
)
The second step, however – standardizing the sales values for each series individually – is not directly supported by scikit-learn out of the box. That is why I implemented a custom transformer: it assumes that the data is sorted by group, so no group labels are needed. It supports pandas Series, 1-column DataFrames, and NumPy arrays, and follows the usual fit, transform, and inverse_transform API.
Internally, it uses NumPy broadcasting for full vectorization — making it efficient to run while keeping the implementation clean and compatible with both Pipeline and TransformedTargetRegressor.
Future extension: handling new or reordered series
If future use cases require new series at inference or arbitrary subsets/reordering, extend the scaler as follows:
- Store per-series statistics in a dictionary keyed by
series_idatfittime (training window only). - At inference, join predictions with their
series_id, and use those IDs to look up the stored parameters for inverse scaling. For feature transforms, includeseries_idinXsotransformcan look up the correct parameters for any subset or order. - For truly new series, compute and cache statistics from the observed pre-forecast window only (no padding), then reuse them for inverse transformation.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
from typing import Union
class GroupStandardScaler(BaseEstimator, TransformerMixin):
"""
Group-wise standard scaler for 1D numerical data.
This transformer standardizes values separately for each group (e.g., time series), assuming that all groups
are of equal size and the input is sorted such that all values of a group are contiguous.
Supports input as a pandas Series, a single-column DataFrame, or a 1D/2D NumPy array.
"""
def __init__(self, num_groups: int, inverse_sorted_by_group: bool, log_assumed_structure: bool=False):
"""num_groups: Number of groups in the dataset."""
self.num_groups = num_groups # the number of series is crucial to transform without using the group ids
self.inverse_sorted_by_group = inverse_sorted_by_group # hint for inverse_transform, how to reshape the data
self.log_assumed_structure = log_assumed_structure # log assumed data structure in inverse_transform?
def to_1d_array(self, X: Union[np.ndarray, pd.Series, pd.DataFrame]):
"""Converts input to a flat 1D NumPy array of shape (n,).
Accepts pandas Series, DataFrames with a single column, or numpy arrays of shape (n,) or (n, 1).
"""
if isinstance(X, pd.DataFrame):
if X.shape[1] != 1:
raise ValueError("Expected DataFrame with one column")
return X.iloc[:, 0].to_numpy()
if isinstance(X, pd.Series):
return X.to_numpy()
if isinstance(X, np.ndarray):
if X.ndim == 2 and X.shape[1] == 1:
return X.ravel() # reduce the array to a 1D array
elif X.ndim == 1:
return X
else:
raise ValueError("Expected either a 1D array or a 2D array with shape (n, 1)")
raise TypeError("Input must be a Series, DataFrame with one column, or 1D/2D array")
def check_group_size(self, X: np.ndarray):
"""Checks whether the number of samples is divisible by the number of groups."""
if X.shape[0] % self.num_groups != 0:
raise ValueError("Input Array is not a multiple of the number of groups. Are all groups of equal size?")
def fit(self, X: np.ndarray | pd.Series | pd.DataFrame, y=None):
"""
Computes group-specific means and standard deviations for standardization.
Assumes that the data is sorted such that each group’s values are contiguous.
"""
if isinstance(X, pd.DataFrame):
self.feature_names_in_ = np.array(X.columns, dtype=object)
X = self.to_1d_array(X)
self.check_group_size(X)
X_reshaped = X.reshape(self.num_groups, -1) # each row represents one series, and the columns the days
self.means_ = X_reshaped.mean(axis=1).reshape(-1, 1) # reshape in (num_groups, 1) for broadcasting in transform
stds = X_reshaped.std(axis=1).reshape(-1, 1) # may contain zeros (if a group's values are all equal)
self.stds_ = np.where(stds == 0, 1e-8, stds) # avoid division by zero in transformation
self.n_features_in_ = 1 # must be 1, otherwise to_1d_array raises an error
return self
def transform(self, X: Union[np.ndarray, pd.Series, pd.DataFrame]) -> np.ndarray:
"""
Applies group-wise standardization to the input.
Assumes that the input is sorted such that all values of each group appear contiguously.
"""
check_is_fitted(self) # are there fitted attributes with trailing underscores?
X = self.to_1d_array(X)
self.check_group_size(X)
X_reshaped = X.reshape(self.num_groups, -1)
X_stand = (X_reshaped - self.means_) / self.stds_ # broadcasting
return X_stand.reshape(-1, 1)
def inverse_transform(self, X: Union[np.ndarray, pd.Series, pd.DataFrame]) -> np.ndarray:
"""
Reverses the transformation and transforms values back to the original scale.
If self.inverse_sorted_by_group = True, assumes group blocks are contiguous (as in fit/transform). If False,
assumes values are interleaved — i.e., all group-1 values occur at positions 0, N, 2N, etc.
Returns a 2D numpy array of shape (n, 1).
"""
X = self.to_1d_array(X)
self.check_group_size(X)
# bring the data in the correct shape for broadcasting
log_message_stem = " for GroupStandardScaler's inverse_transform, does this align with your input?"
if self.inverse_sorted_by_group:
X_reshaped = X.reshape(self.num_groups, -1)
if self.log_assumed_structure:
print("Info: Assumed contiguous group blocks" + log_message_stem)
else:
series_len = X.shape[0]//self.num_groups
X_reshaped = X.reshape(series_len, -1).T
if self.log_assumed_structure:
print("Info: Assumed interleaved values" + log_message_stem)
X_inversed = X_reshaped * self.stds_ + self.means_ # broadcasting
return X_inversed.reshape(-1, 1)
def get_feature_names_out(self, input_features=None):
if input_features is None:
if hasattr(self, "feature_names_in_"):
input_features = self.feature_names_in_
else:
input_features = np.array([""], dtype=object)
return np.array([f"{name}_groupscaled" for name in input_features], dtype=object)
Since this is a solid piece of code, it makes sense to demonstrate its functionality directly. We also add a sales_scaled column to the train DataFrame, as this standardized version of sales will be helpful during exploration.
The example below shows how the scaler works end-to-end, including the inverse transformation to return to the original sales scale.
group_scaler = GroupStandardScaler(num_groups=num_series, inverse_sorted_by_group=True)
# create the sales_scaled column for exploration
train["sales_scaled"] = group_scaler.fit_transform(train["sales"])
# create an example frame for demonstration only
scaler_demo = train[["series_id", "sales", "sales_scaled"]].copy()
scaler_demo[["sales_scaled_inverted"]] = group_scaler.inverse_transform(scaler_demo["sales_scaled"])
scaler_demo
| series_id | sales | sales_scaled | sales_scaled_inverted | |
|---|---|---|---|---|
| 0 | 1_AUTOMOTIVE | 0.000 | -1.171285 | 0.000 |
| 1 | 1_AUTOMOTIVE | 2.000 | -0.446297 | 2.000 |
| 2 | 1_AUTOMOTIVE | 3.000 | -0.083802 | 3.000 |
| 3 | 1_AUTOMOTIVE | 3.000 | -0.083802 | 3.000 |
| 4 | 1_AUTOMOTIVE | 5.000 | 0.641186 | 5.000 |
| ... | ... | ... | ... | ... |
| 3007997 | 9_SEAFOOD | 11.000 | -0.617762 | 11.000 |
| 3007998 | 9_SEAFOOD | 21.916 | 0.539653 | 21.916 |
| 3007999 | 9_SEAFOOD | 19.909 | 0.326852 | 19.909 |
| 3008000 | 9_SEAFOOD | 12.000 | -0.511733 | 12.000 |
| 3008001 | 9_SEAFOOD | 19.316 | 0.263977 | 19.316 |
2983068 rows × 4 columns
We can see that the scaler works as expected and that sales_scaled_inverted exactly matches the original values. Now that we have both transformers, we can build the pipeline that combines both steps:
from sklearn.pipeline import make_pipeline
target_pipe = make_pipeline(log1p_transformer,
GroupStandardScaler(num_groups=num_series, inverse_sorted_by_group=True))
target_pipe
Pipeline(steps=[('functiontransformer',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=1782))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('functiontransformer', ...), ('groupstandardscaler', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Since the correct target transformation—and its reversal—are crucial, we also check the behavior of target_pipe:
# create an example frame for the whole target pipeline
target_pipe_demo = scaler_demo.copy()
target_pipe_demo["sales_transformed"] = target_pipe.fit_transform(target_pipe_demo.sales)
target_pipe_demo["sales_transformed_inverted"] = target_pipe.inverse_transform(target_pipe_demo.sales_transformed)
# Adjust for interleaved prediction output later (regressors do not preserve group order)
target_pipe["groupstandardscaler"].inverse_sorted_by_group=False
target_pipe_demo
| series_id | sales | sales_scaled | sales_scaled_inverted | sales_transformed | sales_transformed_inverted | |
|---|---|---|---|---|---|---|
| 0 | 1_AUTOMOTIVE | 0.000 | -1.171285 | 0.000 | -1.813683 | 0.000 |
| 1 | 1_AUTOMOTIVE | 2.000 | -0.446297 | 2.000 | -0.193629 | 2.000 |
| 2 | 1_AUTOMOTIVE | 3.000 | -0.083802 | 3.000 | 0.230597 | 3.000 |
| 3 | 1_AUTOMOTIVE | 3.000 | -0.083802 | 3.000 | 0.230597 | 3.000 |
| 4 | 1_AUTOMOTIVE | 5.000 | 0.641186 | 5.000 | 0.828511 | 5.000 |
| ... | ... | ... | ... | ... | ... | ... |
| 3007997 | 9_SEAFOOD | 11.000 | -0.617762 | 11.000 | -0.363945 | 11.000 |
| 3007998 | 9_SEAFOOD | 21.916 | 0.539653 | 21.916 | 0.655984 | 21.916 |
| 3007999 | 9_SEAFOOD | 19.909 | 0.326852 | 19.909 | 0.511482 | 19.909 |
| 3008000 | 9_SEAFOOD | 12.000 | -0.511733 | 12.000 | -0.237752 | 12.000 |
| 3008001 | 9_SEAFOOD | 19.316 | 0.263977 | 19.316 | 0.466123 | 19.316 |
2983068 rows × 6 columns
As expected, the sales_transformed column—created by applying our pipeline—produces different values than sales_scaled, where we did not apply log1p before scaling. The sales_transformed_inverted column correctly returns the original sales values. This shows that our pipeline works reliably, and we can move on with exploring the data.
Since it is impossible to visualize all 1,782 time series at once or to experiment with different model architectures on such a large dataset locally, we will draw a sample in the next section.
Clustering the Time Series by Shape ↑¶
Preprocessing for Clustering ↑¶
For sampling, we will cluster the time series by their typical shapes and select the series closest to the average shape of each cluster as its representative. To do this, we first need to uncover the relevant shape characteristics for clustering.
We extract these shapes by decomposing each time series into trend, seasonality, and residual components using STL decomposition. Below is an example from the time series of store number 1 and the product family “Automotive”, where we assume a weekly seasonality as the dominant pattern. We observe a bumpy but steady global trend, with changes in the amplitude over timethat highlight occasional, outliers—extreme values that cannot be explained by the other components and residuals of its spikesa strong recurring weekly pattern:
from statsmodels.tsa.seasonal import STL
auto_1 = train.loc[train.series_id == "1_AUTOMOTIVE"].set_index("date")["sales"]
stl = STL(auto_1,
period=7, # assuming a weekly seasonality, we could also set it to 30 for monthly or 365 for yearly seasonality
trend=101, # defines smoothing of the trend component (we are only interested in large patterns)
seasonal=101 # defines smoothing of the seasonality component
);
res = stl.fit()
fig = res.plot()
fig.set_size_inches(20, 10)
for ax in fig.axes:
if ax.get_title() == "sales" or ax.get_ylabel() == "Season":
for line in ax.lines:
line.set_linewidth(0.8) # make the lines of the seasonal and the raw values finer for better distinction
elif ax.get_ylabel() == "Resid":
for line in ax.lines:
line.set_markersize(3) # reduce the dot size for better distinction
from statsmodels.tsa.stattools import pacf
pacf_auto_1 = pacf(auto_1, nlags=370, method="ywm")[1:]
z = 1.96
plt.figure(figsize=(16, 5));
plt.bar(range(1, 371), pacf_auto_1)
plt.xticks(range(0, 371, 7));
plt.xlim(0, 370);
plt.hlines(z / auto_1.shape[0]**0.5, xmin=0, xmax=370, color="r", linestyle="--");
plt.hlines(-z / auto_1.shape[0]**0.5, xmin=0, xmax=370, color="r", linestyle="--");
By computing the PACF for lags 0 to 370 across all time series, we obtain a distribution of PACF values for each lag, which we can visualize to identify lags with consistently high absolute values across series. Although this step takes a few minutes to compute, it allows us to avoid sampling and instead detect dominant seasonal patterns across the full dataset and helps us to identify clusters of similar time series for further analysis or modeling in following steps.
We start by computing the PACF values across all groupwise scaled time series, since we focus on their shape rather than their scale. Because the computation with statsmodels for many series is very slow, I compute the PACF values for a single series using the established Levinson–Durbin algorithm and then iterate over all series, accelerating the computation with numba. This reduces the runtime to a few seconds instead of roughly 10 minutes with statsmodels.
import numpy as np
from numba import njit
series_ids = train["series_id"].unique()
sales_scaled_2D = train["sales_scaled"].values.reshape(len(series_ids), -1)
@njit
def pacf_levinson(series_matrix, n_series=num_series, nlags=370):
pacfs = np.zeros((n_series, nlags+1))
skipped_mask = np.zeros(n_series, dtype=np.bool_)
for i, x in enumerate(series_matrix):
skipped = False
# check for constant zero series
if x.std() < 1e-6:
skipped = True
pacf = np.full(nlags + 1, np.nan)
pacfs[i], skipped_mask[i] = pacf, skipped
continue
n = len(x)
x = x - np.mean(x)
# compute autocorrelation first
acf = np.zeros(nlags + 1)
den = np.dot(x, x) # equals np.sum(x**2) <=> n * VAR(x), same for all lags
for k in range(nlags + 1):
num = 0.0
for t in range(k, n):
num += x[t] * x[t - k] # equals n * COV(x(t), x(t-k))
acf[k] = num / den
pacf = np.zeros(nlags + 1)
pacf[0] = 1.0 # first pacf always equals 1
phi = np.zeros((nlags + 1, nlags + 1)) # phi are the coefficients of AR process with lags k=n_lags
sigma = acf[0] # equals 1
# Levinson-Durbin algorithm solver:
for k in range(1, nlags+1):
num = acf[k]
for j in range(1, k):
num -= phi[k - 1][j] * acf[k - j]
phi[k][k] = num / sigma
# update phi and sigma for next step
for j in range(1, k):
phi[k][j] = phi[k-1][j] - phi[k][k] * phi[k-1][k-j]
sigma *= (1 - phi[k][k] ** 2)
# extract pacf(k) as kth coefficient of AR(k) process
pacf[k] = phi[k][k]
pacfs[i], skipped_mask[i] = pacf, skipped
return pacfs, skipped_mask
pacfs, skipped_mask = pacf_levinson(sales_scaled_2D)
pacfs
array([[ 1. , 0.09900266, 0.03339383, ..., -0.02540045,
-0.02078068, 0.05298405],
[ nan, nan, nan, ..., nan,
nan, nan],
[ 1. , 0.13940509, 0.1054958 , ..., -0.02811024,
0.00377025, -0.00468863],
...,
[ 1. , 0.78855251, 0.40961657, ..., 0.02550885,
0.05120047, -0.0153876 ],
[ 1. , 0.59355908, 0.20642022, ..., -0.00616862,
0.0400932 , 0.02374785],
[ 1. , 0.30732496, 0.04322177, ..., 0.0160026 ,
-0.00708538, -0.03475395]], shape=(1782, 371))
Since I received errors when running a simple loop, I added a mask array skipped_mask which flags series with a too low standard deviation and assigned nan values for these series in the PACF array. Let us take a quick look at the values of these series. The hypothesis is that they are constant zero:
train.loc[train.series_id.isin(series_ids[skipped_mask]), "sales"].sum()
np.float64(0.0)
The series most often are of the product families baby care, books and a few of lawn and garden or ladieswear
series_ids[skipped_mask]
array(['1_BABY CARE', '10_BOOKS', '11_BOOKS', '12_BOOKS', '13_BABY CARE',
'13_BOOKS', '14_BOOKS', '14_LAWN AND GARDEN', '15_BOOKS',
'16_BOOKS', '16_LADIESWEAR', '17_BOOKS', '18_BOOKS', '19_BOOKS',
'20_BOOKS', '21_BOOKS', '22_BOOKS', '23_BABY CARE',
'25_LADIESWEAR', '28_BOOKS', '28_LADIESWEAR', '29_BOOKS',
'29_LADIESWEAR', '30_BOOKS', '30_LAWN AND GARDEN', '31_BOOKS',
'32_BOOKS', '32_LADIESWEAR', '33_BOOKS', '33_LADIESWEAR',
'34_BOOKS', '35_BOOKS', '35_LADIESWEAR', '36_BOOKS', '39_BOOKS',
'40_BOOKS', '40_LADIESWEAR', '43_BOOKS', '43_LADIESWEAR',
'44_BABY CARE', '45_BABY CARE', '46_BABY CARE', '47_BABY CARE',
'48_BABY CARE', '49_BABY CARE', '50_BABY CARE', '51_BABY CARE',
'52_BABY CARE', '52_BOOKS', '54_BOOKS', '54_LADIESWEAR',
'54_LAWN AND GARDEN', '9_BOOKS'], dtype=object)
# find the IDs of zero-only series through checking the data directly
zero_ids = (train.groupby("series_id", sort=False)["sales"]
.apply(lambda s: s.eq(0).all())
.pipe(lambda s: s.index[s].to_numpy()))
nonzero_series_idx = ~np.isin(series_ids, zero_ids)
# save the IDs of series that are not constantly zero
nonzero_series_ids = series_ids[nonzero_series_idx]
# check if the two differently yielded arrays with zero-only series IDs contain the same elements
np.array_equal(zero_ids, series_ids[skipped_mask])
True
Since the sum of all these time series is equal to zero (and we only have non-negative values), this means that the only value taken by the series in skipped is zero. This suggests that certain store and product family combinations do not exist in practice—i.e., some stores do not sell specific product families.
We will examine these zero-only series in more detail later. For now, we simply save the IDs of the constant-zero series and the IDs of the non-constant-zero series for later use–so we do not need to rerun the resource-intensive PACF computation across all series). Finally, we verify that the series IDs in skipped are indeed the only constant zero series in the DataFrame:
Now we can visualize the distributions of PACF values. While the classic approach would be to use boxplots to compare distributions, this quickly became too crowded and noisy due to the large number of lags and the many outliers. Therefore, I decided to simplify the visualization by plotting only the median (shown as a horizontal line) and the interquartile range (IQR) (represented as a box spanning from the first to the third quartile). Notice that I did not plot the distribution for lag 0, as they are constant 1 and so distort the plot.
When interpreting PACF values for a single time series, it is important to note that not every value different from zero reflects a meaningful pattern. To address this, we typically compute a confidence interval that indicates the threshold beyond which PACF values can be considered statistically significant. However, since we are looking not at a single time series here, we can use the confidence bands in the plot (shown as dotted red lines) heuristically rather than as strict thresholds — more as a guide to ask: is this value high enough to be considered meaningful?
Click here for more explanation below if you are curious:
The red dotted lines in the plot serve as a rough reference for which PACF values can be considered “large enough” to possibly reflect relevant effects. They are based on the classic confidence band used for single time series (±1.96 / √n), but I do not treat them as strict significance thresholds here. Since we are not testing each individual PACF value for each series, but rather looking at the distribution of PACF values per lag across more than 1700 time series, the usual multiple comparisons logic does not apply directly. Instead, we can interpret these bands more heuristically: if the median or even a larger part of the distribution consistently lies outside the band, that may point to a real underlying structure across many series, rather than just random noise. In this way, we are not using a formal statistical test, but rather reasoning in the spirit of empirical Bayesian thinking: if many independent time series show a similar effect at the same lag, it is less likely to be a coincidence — even if the values themselves are not individually “significant” in the strict sense.
# create the dataframe
quantiles = {f"q_{quantile}": np.nanquantile(pacfs, quantile/100, axis=0) for quantile in [25, 50, 75]}
pacfs_quant = (pd.DataFrame(quantiles).reset_index(names="Lag")
.loc[lambda df: df.Lag!=0]) # lags of order 0 are always 1
# create the plot
fig, ax = plt.subplots(figsize=(20, 5))
# draw the median
plt.hlines(y=pacfs_quant["q_50"], xmin=pacfs_quant["Lag"]-0.5, xmax=pacfs_quant["Lag"]+0.5);
# specify the rectangle parameters and draw them
rectangle_x = pacfs_quant["Lag"].values - 0.5
rectangle_y = pacfs_quant["q_25"].values
rectangle_height = (pacfs_quant["q_75"] - pacfs_quant["q_25"]).values
for i in range(len(pacfs_quant["Lag"])):
rectangle = plt.Rectangle(xy=(rectangle_x[i], rectangle_y[i]), height=rectangle_height[i], width=1, color='blue', alpha=0.3)
ax.add_patch(rectangle)
# add the "confidence" band
z=1.96
series_window = sales_scaled_2D.shape[1] # length of each series
plt.hlines(z/series_window**0.5, xmin=0, xmax=370, color="r", linestyle="--");
plt.hlines(-z/series_window**0.5, xmin=0, xmax=370, color="r", linestyle="--");
# beautify the axes
plt.xticks(ticks=range(0, 375, 5), labels=range(0, 375, 5))
plt.tick_params(axis="x", labelrotation=45)
plt.xlim(0, 370)
plt.xlabel("Lag")
plt.ylim(-0.2, 0.7);
plt.ylabel("PACF Value")
plt.title("Medians and IQRs of Observed PACF Values over All Time Series by Lags", size=14, pad=10);
The PACF distributions show a strong trend component (very high values for lag 1 and still elevated medians for lags 2 to 6) and clear weekly seasonality (lag 7), with repeating peaks at every 7th lag (e.g. 14, 21, 28) that slowly fade out. These recurring patterns likely reflect residual seasonality that the PACF could not fully filter out. We find no evidence for monthly seasonality, as lags 29–31 remain near zero. A weak yearly signal appears at lag 365, where the upper quartile slightly exceeds the confidence band.
Overall, the most relevant components are trend, weekly, and possibly yearly seasonality and we will use these components for clustering the time series for further analyses, drawing a sample for faster experimentation as well as using the cluster memberships as features when building models.
from joblib import Parallel, delayed
from statsmodels.tsa.seasonal import STL
def extract_components(series):
weekly = STL(series, period=7, trend=101, seasonal=101).fit()
yearly = STL(series, period=365, seasonal=1095).fit()
return weekly.trend, weekly.seasonal, yearly.seasonal
results = Parallel(n_jobs=-1, backend="loky")(delayed(extract_components)(series)
for series in sales_scaled_2D)
trends, weekly_seasonals, yearly_seasonals = map(
np.array, zip(*results)
)
Now we can feed each of the three resulting components into a short pipeline that first scales them column-wise to prevent individual time steps from dominating, and then reduces the number of dimensions to 20 using principal component analysis.
Before that, however, we exclude the constant-zero series from these components, as they show no shape at all and pull the centroid of their cluster towards them—being, in effect, 53 identical points in our 60-dimensional feature space. This does not just theoretically bias clustering algorithms—in practice, I observed that it noticeably distorted the cluster centers.
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# removing the constant zero series from all components
feature_sets = [feature_set[nonzero_series_idx] for feature_set in (trends, weekly_seasonals, yearly_seasonals)]
feature_sets_names = ["trends", "weekly_seasonals", "yearly_seasonals"]
def series_dim_reduction(feature_sets: list, feature_sets_names: list,
n_components: int=20) -> dict[str, tuple[np.ndarray, Pipeline]]:
reduced_feature_sets = {}
for feature_set, feature_set_name in zip(feature_sets, feature_sets_names):
pipe = make_pipeline(StandardScaler(), # to ensure that single time steps cannot dominate the reduction
PCA(n_components=n_components, random_state=15))
reduced_feature_set = pipe.fit_transform(feature_set)
reduced_feature_sets[feature_set_name] = (reduced_feature_set, pipe)
return reduced_feature_sets
reduced_feature_sets = series_dim_reduction(feature_sets, feature_sets_names)
reduced_feature_sets
{'trends': (array([[ 1.75935388, -3.56319347, -1.8587173 , ..., 1.80596606,
-2.24445468, 2.83272568],
[ 8.28599543, -5.20646674, -3.29408021, ..., 1.1184677 ,
-1.63918954, 8.37268102],
[-30.52726381, 10.72096921, -9.83782329, ..., -1.12365861,
-1.84447455, 1.87581823],
...,
[-40.55121518, 13.44278419, -17.62093765, ..., 0.29359683,
0.94071306, -1.68565199],
[ 13.26523298, 4.63033647, -10.07421773, ..., -8.44863793,
-3.59472209, 0.20756883],
[ 21.08111145, -1.63819504, -13.9800955 , ..., 1.00223069,
-11.24050945, -4.83712189]], shape=(1729, 20)),
Pipeline(steps=[('standardscaler', StandardScaler()),
('pca', PCA(n_components=20, random_state=15))])),
'weekly_seasonals': (array([[-40.87698583, 10.19729599, 3.2106303 , ..., 1.89271998,
-1.94015163, 3.83016826],
[-39.2964498 , 1.33500114, 3.89545495, ..., -7.38300291,
3.03500479, -1.6964752 ],
[-48.6093754 , 17.60226189, 14.23478379, ..., 0.96256253,
0.77721849, -0.83436167],
...,
[-14.4052285 , -21.33849141, -11.33701398, ..., 2.02205253,
0.07618969, 0.3241803 ],
[-17.56169962, -5.70953249, 1.83048987, ..., 1.37156162,
4.82784113, -1.30373095],
[ 23.54791491, -13.33173915, -22.85770395, ..., 0.55242657,
-4.38661542, -3.29899295]], shape=(1729, 20)),
Pipeline(steps=[('standardscaler', StandardScaler()),
('pca', PCA(n_components=20, random_state=15))])),
'yearly_seasonals': (array([[-20.7387107 , -5.53595923, 9.66375345, ..., -6.33809816,
3.68080874, 0.79906162],
[-17.07148599, -9.46890975, 1.96627217, ..., -4.01390152,
6.72045116, 4.32326383],
[-25.73315842, -8.51161013, -4.84910266, ..., -0.90476867,
0.11219904, 1.63149798],
...,
[ -0.78088231, -9.72589427, -17.33446126, ..., 0.49766029,
0.64740599, 2.09169203],
[ -9.90222705, -11.25313425, -6.98107547, ..., -3.28199122,
-1.28400978, -1.26485263],
[ 14.65049255, 2.03545164, 7.011752 , ..., 6.86856398,
3.00710741, -5.1388154 ]], shape=(1729, 20)),
Pipeline(steps=[('standardscaler', StandardScaler()),
('pca', PCA(n_components=20, random_state=15))]))}
It is always a good idea to take a brief look at the proportion of explained variance captured by the reduced number of dimensions:
for feature_set_name in feature_sets_names:
# extract the explained variance stored in the pca object of the corresponding pipeline
ex_var = reduced_feature_sets[feature_set_name][1]["pca"].explained_variance_ratio_
print(f"Percentage of explained variance in {feature_set_name}: {np.cumsum(ex_var)[-1]*100: .1f}%")
Percentage of explained variance in trends: 96.9% Percentage of explained variance in weekly_seasonals: 97.3% Percentage of explained variance in yearly_seasonals: 42.3%
The first 20 principal components already explain 97% of the variance in the trend (i.e., the sum of the 20 largest eigenvalues of the covariance matrix of the STL features, in the language of linear algebra), as well as the same percentage in the weekly seasonality. In the case of the yearly seasonality, they explain only around 43%.
Nevertheless, this was expected, as we already saw in the PACF plot that the yearly seasonality is not nearly as strong as the trend or the weekly seasonality so there simply is not a comparably clear structure to reduce. Still, the 43% demonstrate that some structure is present and can be reduced. Of course we could consider reducing the yearly seasonality to more than 20 dimensions. However this would require a significantly higher number of components, since the 20th principal component explains only around 0.5% of the variance in the yearly seasonality. Including more components would also risk biasing the k-Means clustering toward the yearly seasonality in the next step.
Finding the Appropriate Number of Clusters ↑¶
One challenge with K-Means is that the number of clusters must be defined in advance. To determine a suitable value, we fitted K-Means on the reduced feature set using cluster sizes ranging from 2 to 99 and evaluated the results using the mean silhouette coefficient (also called silhouette score), which ranges from -1 to 1. Higher scores indicate more coherent and better-separated clusters, with fewer instances ambiguously assigned.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# concatenate the series' reduced components
series_shapes = np.concatenate([value[0] for value in reduced_feature_sets.values()], axis=1)
silhouette_scores = []
max_cluster_range = range(2, 100)
for k in max_cluster_range:
kmeans = KMeans(n_clusters=k, random_state=14)
cluster_labels_ = kmeans.fit_predict(series_shapes)
score = silhouette_score(series_shapes, cluster_labels_)
silhouette_scores.append(score)
silhouettes = pd.DataFrame({"Number of Clusters": max_cluster_range, "Silhouette Score": silhouette_scores})
plt.figure(figsize=(16, 5));
sns.lineplot(silhouettes, x="Number of Clusters", y="Silhouette Score");
plt.xlim(0, max_cluster_range.stop);
xticks = range(0, max_cluster_range.stop, 5)
plt.xticks(xticks, xticks);
Although the silhouette scores are modest (maximum ≈ 0.18), this is typical for time series clustering over decomposed STL components—especially in the presence of a dominant trend and only partial yearly structure. Since the scores for models with 2 and 7 clusters are relatively close (the plot's scale is very large), we also examine silhouette diagrams, where the instances within each cluster are sorted by their individual silhouette coefficient and stacked. Additionally, we plot the mean silhouette score (shown as a vertical grey dashed line), since it is desirable that as many instances as possible within each cluster show high coefficients.
from sklearn.metrics import silhouette_samples
from sklearn.metrics import silhouette_score
import matplotlib.cm as cm
sns.set_style("whitegrid")
range_n_clusters = range(2, 8)
fig, axs = plt.subplots(len(range_n_clusters)//2, 2, figsize=(16, len(range_n_clusters)*2))
fig.tight_layout()
for n_cluster, ax in zip(range_n_clusters, axs.flatten()):
kmeans = KMeans(n_cluster, random_state=14)
cluster_labels = kmeans.fit_predict(series_shapes)
silhouette_avg = silhouette_score(series_shapes, cluster_labels)
sample_silhouette_values = silhouette_samples(series_shapes, cluster_labels)
y_lower = 10
y_ticks = []
for i in range(n_cluster):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_cluster)
ax.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
y_ticks.append((y_upper - y_lower)/2 + y_lower)
y_lower = y_upper + 50
ax.set_title(f"{n_cluster} Clusters");
ax.set_yticks(y_ticks, range(0, n_cluster))
ax.set_ylabel("Cluster")
ax.set_xlabel("Silhouette Coefficient")
ax.vlines(silhouette_avg, -100, 3000, color="grey", linestyle="dotted")
ax.set_ylim(-100, y_upper+100) # we can reuse the highest y_upper from the previous loop
ax.set_xlim(-0.15, 0.44)
ax.grid(visible=False)
plt.subplots_adjust(hspace=0.4)
We see that models with 2 and 7 clusters are not optimal, as both include clusters where many instances have coefficients below the average. The clusters of models with 3 and 4 clusters both show higher silhouette scores, and we would normally prefer a clustering with more balanced cluster sizes (as seen in the 3-cluster model). While having lower mean silhouette scores (indicated by the vertical grey dashed line), the models with 5 and 6 clusters still produce acceptable diagrams.
However, from these plots it is still difficult to determine which of the models with 3 to 6 clusters best captures the underlying structure. To gain a better understanding of the shapes data, we map the clustered time series shapes along with their centroids (marked as bold Xs) using UMAP in two dimensions to examine the structure further.
from umap import UMAP
n_clusters = range(3, 7)
# map series shapes in 2D:
umap = UMAP(n_neighbors=30,
n_components=2, # we want a 2D output
random_state=14,
n_jobs=1) # required if a random seed was set
reduced_shapes = umap.fit_transform(series_shapes)
shape_clusters = []
centroids_list = []
for n_cluster in n_clusters:
kmeans = KMeans(n_cluster, random_state=14)
cluster_labels = kmeans.fit_predict(series_shapes)
shape_clusters.append(
pd.DataFrame({"series_id": series_ids[nonzero_series_idx],
"Cluster": cluster_labels,
"Number of Clusters": n_cluster,
"UMAP 1": reduced_shapes[:, 0],
"UMAP 2": reduced_shapes[:, 1]})
)
centroids_list.append(umap.transform(kmeans.cluster_centers_))
shape_clusters_df = pd.concat(shape_clusters, axis=0)
g = sns.relplot(shape_clusters_df, x="UMAP 1", y="UMAP 2", hue="Cluster", col="Number of Clusters", col_wrap=2,
alpha=0.5, height=4, aspect=2, palette="viridis")
for centroids, ax in zip(centroids_list, g.axes.flatten()):
ax.scatter(x=centroids[:, 0], y=centroids[:, 1], marker="o", s=80, color="lightgrey", edgecolors="face", alpha=0.8);
ax.scatter(x=centroids[:, 0], y=centroids[:, 1], marker="x", s=50, color="black", linewidths=1.5);
plt.suptitle("UMAP Projection by Cluster");
plt.subplots_adjust(top=0.9);
When we look at the plots for the models with 3 and 4 clusters, which also have the highest mean silhouette scores, they appear rather coarse: the 3-cluster model assigns the clearly separated structure at the top to cluster 2 and both models fail to divide the lower structure which is visibly separable into two parts. As a result, the centroid for this cluster is pulled toward the denser right side.
Interestingly, the model with 5 clusters also does not split the lower structure but instead introduces a new cluster in the center. At first, I thought this central cluster might be an artefact—possibly due to cluster 2 of the 4-cluster model being arbitrarily split in two. However, when projecting the data using t-SNE instead of UMAP, this region appears more distinct (though t-SNE does not support projecting centroids). Moreover, when using Gaussian mixture models instead of k-means the central cluster is also detected in the 5- and 6-cluster models.
Only the 6-cluster model successfully separates the lower structure into a left and right part and shifts the central cluster further into the data's core, resulting in centroids that lie close to the visual center of each group. Furthermore, a Gaussian mixture model with 6 clusters yields a very similar result, which suggests that the k-means clustering is not an artefact.
I also experimented with density based clustering using HDBSCAN, but it either produced a few very small clusters and one dominant one, or only a single cluster. This suggests the density contrast in the data is too low. For our goal of finding representative time series, k-means remains superior to Gaussian mixture models, as its centroids can be directly interpreted as average shapes in our STL feature space, and it provides clear group boundaries through hard assignments.
Since silhouette scores are only one way to assess clustering performance, I also computed the mean intra-cluster distance and the Dunn index for k-means models with 3 to 6 clusters for additional robustness. The mean intra-cluster distance calculates the average distance between all data points within the same cluster and then takes the mean of these values across all clusters. It reflects the compactness of a clustering: lower values indicate that points within each cluster are close to one another. The Dunn index additionally considers separation between clusters by dividing the smallest inter-cluster distance (i.e., the distance between two different clusters) by the largest intra-cluster distance. Higher Dunn index values suggest well-separated and compact clusters.
from sklearn.metrics import pairwise_distances
import numpy as np
from scipy.spatial.distance import cdist
def mean_intra_cluster_distance(X, labels):
clusters = np.unique(labels)
intra_dists = []
for c in clusters:
if c == -1:
continue # skip if assigned -1 for constant-zero series
cluster_points = X[labels == c]
if len(cluster_points) <= 1: # we cannot calculate the pairwise distance of a single point
continue
dists = pairwise_distances(cluster_points)
mean_dist = np.sum(dists) / (len(cluster_points) * (len(cluster_points) - 1)) # we have n*(n-1) cluster_points
intra_dists.append(mean_dist)
return np.mean(intra_dists)
def dunn_index(X, labels):
clusters = np.unique(labels)
clusters = clusters[clusters != -1] # optionally exclude -1
intra_dists = []
inter_dists = []
for i in clusters:
Xi = X[labels == i]
if len(Xi) <= 1:
continue
intra = np.max(pairwise_distances(Xi))
intra_dists.append(intra)
for j in clusters:
if i >= j:
continue
Xj = X[labels == j]
dist = np.min(cdist(Xi, Xj))
inter_dists.append(dist)
return np.min(inter_dists) / np.max(intra_dists)
n_clusters = range(3, 7)
mean_intra_cluster_dists = {}
for n_cluster in n_clusters:
kmeans = KMeans(n_cluster, random_state=14)
cluster_labels = kmeans.fit_predict(series_shapes)
mean_intra_cluster_dists[f"{n_cluster}_cluster"] = mean_intra_cluster_distance(series_shapes, cluster_labels)
dunn_indices = {}
for n_cluster in n_clusters:
kmeans = KMeans(n_cluster, random_state=14)
cluster_labels = kmeans.fit_predict(series_shapes)
dunn_indices[f"{n_cluster}_cluster"] = dunn_index(series_shapes, cluster_labels)
fig, axes = plt.subplots(1, 2, figsize=(16, 4), constrained_layout=True)
ks = [str(n_cluster) for n_cluster in n_clusters]
intra_vals = list(mean_intra_cluster_dists.values())
dunn_vals = list(dunn_indices.values())
# Mean intra-cluster distance (lower is better)
ax = axes[0]
bars = ax.bar(ks, intra_vals, color='#3b528b', edgecolor='none')
ax.set_title('Mean Intra-cluster Distance (↓ better)')
ax.set_xlabel('Number of Clusters'); ax.set_ylabel('Distance');
# highlight k=6
bars[ks.index("6")].set_color('#2c3278'); bars[ks.index("6")].set_edgecolor('black')
# Dunn Index (higher is better)
ax = axes[1]
bars = ax.bar(ks, dunn_vals, color="#25848e", edgecolor='none')
ax.set_title('Dunn Index (↑ better)')
ax.set_xlabel('Number Of Clusters'); ax.set_ylabel('Index');
bars[ks.index("6")].set_color('#2a788e'); bars[ks.index("6")].set_edgecolor('black')
The 6-cluster model achieves the best (lowest) mean intra-cluster distances. However, this metric is not agnostic to the number of clusters and tends to decrease as the number of clusters increases—since additional clusters naturally lead to smaller within-cluster distances.
The Dunn index yields the same value for 3-, 4- and 6-cluster models, indicating that these three solutions are similarly compact and well-separated.
Taking this into account—along with the visual inspection of the clustering projected into two dimensions with UMAP and the subsequent examination of the average shapes and distributions of product families and stores per cluster—I conclude that the 6-cluster model, despite its lower mean silhouette score, provides the best overall fit.
When we plot each clusters' average shape along with its individual time series, we find clearly distinct patterns across clusters, further supporting this choice:
from matplotlib.collections import LineCollection
from matplotlib.lines import Line2D
import matplotlib.dates as mdates
# KMeans clustering like before
kmeans = KMeans(6, random_state=14)
cluster_labels = kmeans.fit_predict(series_shapes)
shape_clusters = pd.DataFrame({
"series_id": nonzero_series_ids,
"Cluster": cluster_labels
})
# add cluster labels to train
clusters_dict = dict(zip(shape_clusters.series_id, shape_clusters.Cluster))
train["shape_cluster"] = train.series_id.map(clusters_dict)
# Pivot: rows = date, columns = series_id and select only nonzero series
pivoted = train.pivot(index="date", columns="series_id", values="sales_scaled")[nonzero_series_ids]
date_nums = mdates.date2num(pivoted.index) # numerical representation aligns with numpy floats
# Compute mean per cluster in a vectorized way
cluster_map = shape_clusters.set_index("series_id")["Cluster"]
mean_by_cluster = pivoted.T.groupby(cluster_map).mean().T # transpose -> mean per cluster -> transpose back
# columns of mean_by_cluster are cluster labels (0:5)
sorted_clusters = sorted(np.unique(cluster_labels))
# create empty facet grid
g = sns.FacetGrid(train, col="shape_cluster", col_wrap=1, height=3, aspect=6)
# fill the axes
for ax, c in zip(g.axes.flatten(), sorted_clusters):
series_ids_in_c = shape_clusters.loc[shape_clusters.Cluster == c, "series_id"]
# matrix of series in this cluster:
data = pivoted[series_ids_in_c].values # shape (T, n_series)
# draw all individual series as one LineCollection for massive speed gain
lines = [np.column_stack([date_nums, data[:, j]])
for j in range(data.shape[1])]
lc = LineCollection(lines, colors="blue", alpha=0.05)
ax.add_collection(lc)
# add cluster mean
mean_series = mean_by_cluster[c].values
ax.plot(date_nums, mean_series, color="yellow", linewidth=1.5)
ax.set_xlim(date_nums.min(), date_nums.max())
ax.set_ylim(-5, 7.5)
ax.xaxis_date()
ax.set_ylabel("Sales Scaled")
ax.set_title(f"Cluster {c}")
legend_lines = [
Line2D([0], [0], color="blue", alpha=0.2, label="Individual series"),
Line2D([0], [0], color="yellow", label="Cluster mean")
]
g.axes[0].legend(handles=legend_lines, loc="upper left", frameon=True);
Looking at the clusters, we see indeed six distinct shapes:
- Cluster 0 and 1 appear similar at first glance, but differ in key aspects: Cluster 0 shows higher variaton, resulting in a wider band of individual series around the mean. In contrast, Cluster 1 has a larger amplitude in its mean and a much narrower band, indicating a more consistent pattern.
- Cluster 2 consists of series with very high variation, which causes the mean to average out close to zero. This is also the cluster where the constant-zero series were originally classified to before being excluded due to the bias they introduced.
- Cluster 3 has a very characteristic shape: It shows relatively high variation, and its mean shifts several times between two levels. These shifts are accompanied by changes in amplitudes until mid-2015, after which the level stabilizes at a higher value.
- Cluster 4 exhibits a very regular pattern with minimal variation. It is also the only cluster whose mean spikes sharply upwards at the end of each year (we will see why in the next plot).
- Cluster 5 emerged from a split of Cluster 3 in the 4-cluster model. While visually similar to Cluster 3, it shows smaller level shifts and a steady upward trend throughout 2015, with following slightly higher amplitudes than those seen in cluster 3.
To further justify the clustering and better understand the structure within each cluster, we now examine the distribution of product families and stores within these found clusters.
def heatmap_sums(df: pd.DataFrame, row: str, col: str, xlabel: str, ylabel: str, title: str, cmap: str,
col_labelrotation: int = 0, annotations: bool = False, show_colsums: bool = True,
show_rowsums: bool = True, show_totalsum: bool = True):
"""
Takes a DataFrame and visualizes unique combinations of two of its columns in a binary heatmap summing up
the rows and cols of this heatmap
"""
value_counts_df = (df.drop_duplicates("series_id")[[row, col]]
.value_counts()
.sort_index()
.reset_index()
)
xtab = (pd.crosstab(index = value_counts_df[row],
columns = value_counts_df[col],
values = value_counts_df["count"],
aggfunc = "sum")
.fillna(0)
.astype(int))
# save the margin sums
col_totals = np.sum(xtab, axis=0)
row_totals = np.sum(xtab, axis=1)
# create the main plot
plt.figure(figsize=(17, 6))
ax = sns.heatmap(
xtab,
cmap=cmap,
cbar=False,
linewidths=0.5,
square=True,
linecolor="grey",
annot=annotations,
annot_kws={"size": 7, "color": "grey"}
)
# adjust the axes
ax.tick_params(axis="x", top=True, bottom=False, labeltop=True, labelbottom=False, length=0,
labelrotation=col_labelrotation)
ax.xaxis.set_label_position('top')
plt.xlabel(xlabel, size=12, labelpad=10)
plt.ylabel(ylabel, size=12)
heat_cols = xtab.columns
heat_index = xtab.index
# annotate column totals (bottom of each column)
if show_colsums:
for j, col in enumerate(heat_cols):
total = col_totals[col]
ax.text(j + 0.5, len(heat_index) + 1, f"{total}",
ha="center", va="bottom", fontsize=10, color="black", weight="bold")
ax.text(-0.5, len(heat_index) + 1, "Σ",
ha="center", va="bottom", fontsize=10, color="black", weight="bold")
# annotate row totals (right of each row)
if show_rowsums:
for i, row in enumerate(heat_index):
total = row_totals[row]
ax.text(len(heat_cols) + 1, i + 0.5, f"{total}",
ha="right", va="center", fontsize=10, color="black", weight="bold")
ax.text(len(heat_cols) + 0.8, -0.4, "Σ",
ha="right", va="center", fontsize=10, color="black", weight="bold")
# annotate the overall sum label
if show_totalsum:
ax.text(len(heat_cols) + 1, len(heat_index) + 1, f"{sum(col_totals)}",
ha="right", va="bottom", fontsize=10, weight="bold");
plt.title(title, size=16, pad=20);
heatmap_sums(train,
row="shape_cluster",
col="family",
xlabel="Product Family",
ylabel="Cluster",
title="Count of Time Series by Product Family and Cluster",
cmap="viridis",
col_labelrotation=90,
annotations=True,
show_colsums=True)
unique_clusters = np.unique(cluster_labels) # pd.Series cannot be converted to in due to NaNs of zero-onlys
plt.yticks(ticks = unique_clusters + 0.5, labels = unique_clusters);
There are clearly different patterns across the clusters:
- Cluster 0 and Cluster 1 share a few product families related to daily foods—such as Bread/Bakery, Dairy, Deli, Eggs, Grocery I, Meats, Poultry and Prepared Foods—as well as daily consumables like Cleaning and Personal Care. This aligns well with their similar average shape. However, these series are more strongly represented by Cluster 0, which also includes Automotive and Beauty—both of which fit with Cleaning and Personal Care as everyday non-food products.
- Cluster 2 overlaps with Cluster 0 in Automotive and Beauty but also contains products that can easily be bought in advance and stored such as Baby Care, Books, Frozen Food, Hardware, Home Appliances, Lawn and Garden, Lingerie and School and Office Supplies. The only product families that are a less represented and may not quite fit this pattern are Meats, Seafood and Grocery II. However, without knowing the distinction between Grocery I and II, it is possible that Grocery II includes non-perishable foods. But remember this cluster has the highest variation, so patterns here may be strongly influenced by discounts or promotions.
- Cluster 3 seems to focus on products linked to home comfort or social occasions, including Beverages, Celebration, Home and Kitchen I and II, Home Care, Pet Supplies, Players and Electronics and possibly Produce (although this is a bit harder to interpret ). Ladieswear is also present here and may relate to personal preparation for such occasions.
- Cluster 4 is dominated by alcoholic drinks (Liquor, Wine, Beer) and also includes some series from Meats—products which might be bought together. One might have expected alcoholic beverages to fall into Cluster 3, but recall Cluster 4 was the only cluster whose mean spikes at the end of each year, which aligns with Christmas and New Year’s Eve celebrations, when alcoholic drinks are commonly purchased.
- Cluster 5 which emerged together with Cluster 3 from a bigger cluster of the 4-cluster model, includes time series from nearly all families except Books. It is characterized by Magazines and to a lesser extent by Home and Kitchen II, Pet Supplies (similar to Cluster 3) and Prepared Foods.
Finally, note that each product family originally occurs 54 times (once in every store). Since we excluded the constant-zero series, we can see now that Baby Care, Books, Ladieswear and Lawn and Garden are the only product families that contain zero-only series, as shown by their lower column sums.
Now let us also take a quick look at the distribution of the stores.
heatmap_sums(train.assign(store_nbr=train.store_nbr.astype(int)), # reassign as integer for correct sorting
row="shape_cluster",
col="store_nbr",
xlabel="Store",
ylabel="Cluster",
title="Count of Time Series by Store Number and Cluster",
cmap="viridis",
col_labelrotation=90,
annotations=True,
show_colsums=True)
plt.yticks(ticks=unique_clusters + 0.5, labels = unique_clusters);
It is impossible to interpret this plot meaningfully at the individual level, since the pure store number does not convey any informative property about the store itself. Nevertheless, it helps to uncover distinct patterns across clusters: While Clusters 0 and 1 included similar product families, this is not the case for the stores, which explains why their shapes looked similar but still differed. Interestingly, the opposite is true for Clusters 2 and 3, which include similar stores but very different product families.
Cluster 4 is represented only a few times in nearly every store, likely due to its strong connection to alcoholic drinks, whereas Cluster 5 is heavily dominated by just six stores—which explains its differing shape compared to Cluster 3, even though both emerged from the same cluster in the 4-cluster model.
Note also that each of the 33 product families is originally offered in every store. However, many stores now show slightly lower counts due to the exclusion of the constant-zero series. In contrast to the product families—where only four included zero-only series—it is more common for stores to have not sold any items from these specific product families.
Since we have found characteristic patterns in the distributions of product families and stores for each cluster—and validated the clustering with multiple metrics, silhouette diagrams, UMAP projections, and average shapes—we can now confidently conclude that the 6-cluster model is reasonable and meaningful.
We will now take a closer look at the typical shapes of each cluster by inspecting the STL decompositions of the time series closest to each centroid. This will help us determine whether each cluster is shaped more by trend, weekly, or yearly seasonality, and allow us to derive useful features for our models. We start by identifying the representative time series for each cluster:
# get the series closest to each centroid and their labels
series_dist = kmeans.transform(series_shapes)
representative_idx = np.argmin((series_dist), axis=0)
representative_series_ids = nonzero_series_ids[representative_idx]
representative_labels = kmeans.predict(series_shapes[representative_idx]) # just an array from 0 to 5
# build a mapping dict
representative_dict = dict(zip(representative_series_ids, representative_labels))
# assign the label of the cluster to its representative series in the DataFrame and give the non-representatives another value
train["cluster_rep"] = train.series_id.map(representative_dict).fillna(-2) # -1 is already reserved for the only-zero series
train.loc[train.cluster_rep!=-2, ["series_id", "shape_cluster", "cluster_rep"]].drop_duplicates().sort_values("cluster_rep")
| series_id | shape_cluster | cluster_rep | |
|---|---|---|---|
| 654944 | 2_PERSONAL CARE | 0.0 | 0.0 |
| 1853424 | 4_DELI | 1.0 | 1.0 |
| 511464 | 18_CELEBRATION | 2.0 | 2.0 |
| 643128 | 2_HOME CARE | 3.0 | 3.0 |
| 594176 | 19_LIQUOR,WINE,BEER | 4.0 | 4.0 |
| 2143760 | 44_HOME AND KITCHEN II | 5.0 | 5.0 |
Examining the Clusters ↑¶
We will now examine these representative time series in more detail in order to better understand the dataset and uncover patterns and relationships that are essential for accurate forecasts. As we would expect, they look similar to the average series of each cluster—as plotted above—since k-means computes the centroid of a cluster as the mean along each dimension. The series that are closest to the centroid are therefore also closest to the cluster's average shape, which is why we may interpret them as representative.
To illustrate this, we now plot the shadowed raw values of each representative time series along with their STL-decomposed trend, which we are going to inspect next:
component_dfs = []
for i, comp in enumerate(["trend", "weekly", "yearly"]):
base_df = (train.loc[train.cluster_rep!=-2, ["series_id", "date", "sales_scaled", "cluster_rep"]]
.sort_values(["cluster_rep", "date"])
.copy())
component_flat = np.array(feature_sets)[i, representative_idx, :].flatten()
base_df["component"] = comp
base_df["decomposed"] = component_flat
component_dfs.append(base_df.rename(columns={"sales_scaled": "raw"}))
component_df = (pd.concat(component_dfs, axis=0)
.melt(id_vars=["series_id", "date", "cluster_rep", "component"],
var_name="Scaled Sales Version",
value_name="Sales Scaled"))
trend_reps = component_df[component_df.component=="trend"].copy()
g = sns.relplot(trend_reps,
x="date", y="Sales Scaled", row="cluster_rep", height=2, aspect=6, hue="Scaled Sales Version", kind="line");
custom_lines = [
Line2D([0], [0], color='tab:blue', lw=1.5, alpha=0.3, label='Raw Values'),
Line2D([0], [0], color='tab:orange', lw=1.5, alpha=1.0, label='Decomposed Trend')
]
for i, ax in enumerate(g.axes.flatten()):
ax.set_xlabel("Date");
ax.set_ylim((-5, 5));
ax.set_title(f"Representative for Cluster {i}", size=10);
for line in ax.lines:
label = line.get_label()
if label in ["_child0", "raw"]:
line.set_alpha(0.3) # Set transparency for "raw"
# adjust legend
labels = [line.get_label() for line in custom_lines]
g.fig.legend(handles=custom_lines, labels=labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.04), # adjust upward offset
ncol=4, # number of columns for legend items
frameon=False);
# add super title
g.fig.suptitle("Trend of Cluster Representatives along with Their Raw Values", size=16, y=1.08);
In general we can say that the representative time series indeed resemble the cluster means and that three distinct trend patterns emerge when we ignore isolated spikes caused by special events. Remember that these series are groupwise scaled and the unit represents the series-specific standard deviation, so even small differences (e.g. 0.5) are meaningful. Also note that the STL trend component is smoothed to highlight broader structural differences, not short-term fluctuations:
- Series with a slight global upward trend, such as the representatives for Cluster 0 and 1 which represent mostly daily food and a few other daily consumables.
- Series with a stable trend close to the mean level, like Cluster 2 and 4.
- The trend of Cluster 2—which represents easily storable products—is nearly flat, with only a very slight slope.
- The trend of Cluster 4, representing alcoholic beverages, follows a recurring yearly pattern, which remains in the trend component because we decomposed using weekly seasonality only (STL can extract only one seasonal component at a time). Still, we can observe a slight global upward trend here as well.
- Series with a shifted level which start below the mean and transition into a higher state, such as Cluster 3 and 5.
- Cluster 3 shows a more abrupt transition with a clear plateau from 2015 onward.
- Cluster 5, in contrast, exhibits a smoother level shift during 2015.
All cluster representatives show at least a slight global upward trend, which should be considered as a feature when training a model. Plausible causes for this might be a growing or wealthier population.
We continue with the examination of the average weekly seasonality. Since our dataset is very large and the time series long, confidence intervals would be narrow and visually uninformative. I therefore show ±1 standard deviation bands, which illustrate the variability of the data around the mean.
import calendar
from typing import Optional
from matplotlib.axes import Axes
# compute the mean and std for each weekday of each cluster
weekly_reps = component_df[component_df["component"]=="weekly"].copy()
weekdays = list(calendar.day_name)
weekly_reps["Weekday"] = pd.Categorical(weekly_reps.date.dt.day_name(),
categories=weekdays,
ordered=True)
weekly_reps_agg = (weekly_reps.loc[weekly_reps["Scaled Sales Version"]=="decomposed"]
.groupby(["cluster_rep", "Weekday"], observed="True")["Sales Scaled"]
.agg(["mean", "std"])
.reset_index())
train_weekly_agg = train.groupby("weekday", observed=True)["sales_scaled"].agg(["mean", "std"]).reset_index()
# plot the cluster means
g = sns.relplot(weekly_reps_agg, x="Weekday", y="mean", col="cluster_rep", col_wrap=3, kind="line",
label="Cluster Mean");
# Define plotting function
def plot_with_band(data: pd.DataFrame, xvar: str, label_band: str, label_line: str=None, linestyle: str="-",
color_line: str=None, color_band: str=None, linewidth: int=2, alpha: float=0.2,
ax: Optional[Axes] = None, **kwargs):
if ax is None:
ax = plt.gca()
ax.plot(data[xvar], data["mean"], label=label_line, linestyle=linestyle, color=color_line, linewidth=linewidth)
ax.fill_between(data[xvar],
data["mean"] - data["std"],
data["mean"] + data["std"],
alpha=alpha,
color=color_band,
linestyle=linestyle,
label=label_band)
# map the plotting function over each subplot
g.map_dataframe(plot_with_band, xvar="Weekday", label_band="Cluster ±1σ", linewidth=2)
for i, ax in enumerate(g.axes.flat):
plot_with_band(train_weekly_agg, "weekday", label_line="Global Mean", label_band="Global ±1σ", linestyle="--",
color_line="grey", color_band="lightgrey", ax=ax)
ax.set_title(f"Representative for Cluster {i}")
ax.set_xlim((0, 6))
ax.tick_params(axis="x", labelrotation=30)
ax.set_ylabel("Scaled Sales")
# adjust legend
handles, labels = ax.get_legend_handles_labels()
g.fig.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.04), # adjust upward offset
ncol=4, # number of columns for legend items
frameon=False)
# add a separated super title
g.fig.suptitle("Weekly Seasonality Pattern of Cluster Representatives", size=16, y=1.08);
plt.subplots_adjust(wspace=0.2)
Along with each cluster representative’s mean and standard deviation (blue line and band), the global mean and standard deviation pattern (dashed grey line and band) is also plotted. This allows for direct comparison between the cluster-specific weekly seasonality and the overall pattern.
The global pattern shows two distinct plateaus for weekdays and weekends, with only light variation across weekdays—reflecting that most people have more time for shopping on weekends. As expected, the global variation is higher, mostly encompassing the cluster-level curves (except for part of Sunday in Cluster 1). Noteworthy deviations include:
- Cluster 0 (daily food and consumables): Follows the global trend but with lower sales on Fridays and higher sales on Sundays—likely reflecting increased food purchases on weekends.
- Cluster 1 (similar products, different stores): Shows a distinct weekend-focused pattern, with much higher sales on Saturdays and Sundays, and lower activity during the week. This supports the separation from Cluster 0 and suggests that Cluster 1 may include large-format stores where people shop on weekends for bigger purchases.
- Cluster 2 (non-perishables and non-food items): Displays almost no weekly seasonality, except for a small Sunday increase with higher variance. This fits well with product families such as Automotive and Hardware, where timing is less routine and purchases may require assistance or coincide with promotions.
- Cluster 3 (home comfort, social goods): Follows the global pattern closely, with moderate increases toward the weekend.
- Cluster 4 (alcoholic beverages): Stands out with low Sunday values and a continuous increase through the week, peaking on Saturday—suggesting purchase timing is driven by social consumption and the desire to avoid weekday hangovers. Interestingly, this concern seems to become less relevant from Sunday to Thursday.
- Cluster 5 (magazines and mixed products): Shows a moderate, average weekly seasonality, plausible for a diverse mix of product families.
Overall, the clear differences in weekly patterns—together with the trend structures—underscore that the clustering captured distinct temporal shapes even before factoring in the weeker yearly effects, which we will explore next. Before looking at the cluster representatives we will first look on the global mean and standard deviation for a more informed comparison with the cluster representatives' pattern. We aggregate over combinations of month and day instead of the day of year, to avoid possible misinterpretations due to the leap year of 2016.
train["month_day"] = train.date.dt.strftime("%m-%d")
train_yearly_agg = train.groupby("month_day")["sales_scaled"].agg(["mean", "std"]).reset_index().sort_values("month_day")
# define the ticks for the x-axis
month_ticks = np.cumsum([0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30]) # 29 for February, as "02-29" is included
month_labels = list(calendar.month_abbr)[1:]
fig, ax = plt.subplots(figsize=(18, 6))
plot_with_band(train_yearly_agg, "month_day", label_line="Global Mean", label_band="Global ±1σ", linewidth=2)
plt.xticks(ticks=month_ticks, labels=month_labels);
plt.xlabel("Day of Year"); plt.ylabel("Scaled Sales");
plt.xlim(0, 365); # since "02-29" is included we have 366 (0-365) days
plt.ylim(-2.5, 3);
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.12),
ncol=2,
frameon=False);
plt.title("Yearly Seasonality Pattern across all Time Series", y=1.13);
We observe very low values on New Year's Day and on Christmas day (25 December) (note: this date was missing entirely and filled with 0, since all stores are closed). In contrast, sales peak in the week before Christmas and stay especially high between 26 December and New Year's Eve.
Another calendar effect is visible as well: values tend to spike on the first of each month, remain elevated for a few days afterwards, and rise again shortly before the end of the month.
It is important to not confuse these recurring spikes with a true monthly seasonality. Recall that our PACF analysis showed values close to zero for lags 29 to 31. This is because months have different lengths and because the variation in the middle of a month is high and centered around the mean, which cancels out correlations. The apparent recurring pattern within each month is in fact an artifact of the weekly cycle: in normal years the weekly pattern shifts by 1 day (and in leap years by 2 days). Since we only observe 4.5 years (including one leap year) the averaging did not smooth out the weekly pattern. Over a longer period, closer to a full multiple of the weekly cycle, these peaks would flatten.
Now we will look at how the cluster representatives deviate from the general pattern.
# compute the mean and std for each cluster
yearly_reps = component_df[component_df["component"]=="yearly"].copy()
yearly_reps["month_day"] = yearly_reps.date.dt.strftime("%m-%d")
yearly_reps_agg = (yearly_reps.loc[yearly_reps["Scaled Sales Version"]=="decomposed"]
.groupby(["cluster_rep", "month_day"], observed="True")["Sales Scaled"]
.agg(["mean", "std"])
.reset_index()
.sort_values("month_day")
.fillna(0)) # for std of "02-29" which occurs only once per series and there tries to divide by n-1=1-1=0 and returns na
# create the faceted plot for the clusters
g = sns.relplot(yearly_reps_agg, x="month_day", y="mean", col="cluster_rep", col_wrap=2, aspect=1.5, kind="line", label="Cluster Mean", facet_kws={"sharey": True});
# add the std band
g.map_dataframe(plot_with_band, xvar="month_day", label_band="Cluster ±1σ", linewidth=2)
for i, ax in enumerate(g.axes.flat):
plot_with_band(train_yearly_agg, "month_day", label_line="Global Mean", label_band="Global ±1σ", linestyle="-",
color_line="grey", color_band="lightgrey", ax=ax)
# beautify the sublplots
ax.set_title(f"Representative for Cluster {i}")
ax.set_xlim(0, 366)
ax.set_ylim(-2, 10) # we clip that for better scaling
ax.set_xticks(ticks=month_ticks, labels=month_labels)
ax.set_xlabel("Day of Year")
ax.set_ylabel("Scaled Sales")
# adjust legend
handles, labels = ax.get_legend_handles_labels()
g.fig.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.04), # adjust upward offset
ncol=4, # number of columns for legend items
frameon=False)
# add super title
g.fig.suptitle("Yearly Seasonality Pattern by Clusters", size=16, y=1.08);
plt.subplots_adjust(wspace=0.1)
Firstly, we notice that the bumpy weekly seasonality pattern as well as the end/start-of-month effect also occur in the cluster representatives. The series vary in amplitude, characteristic spikes, and overall variation:
- Cluster 0 (daily food and consumables): Follows the global pattern but lacks the distinct peaks before and after Christmas. The spikes in April and May likely reflect the aftermath of the 16 April Earthquake in Manabí, as the raw values are exceptionally high only on the corresponding dates in 2016.
- Cluster 1 (similar products, different stores): Also tracks the global pattern, but with higher spikes around Christmas, month beginnings, and higher variability overall. This supports the idea that Cluster 1 might represent large-format stores. Additional spikes appear on Labor Day (1 May) and a minor one on Cuenca Independence Day (3 November).
- Cluster 2 (non-perishables and non-food items): Displays nearly no yearly pattern at all—typical for non-food categories. The sharp spike on 21 December 2014 is an extreme outlier for which the data provides no explanation.
- Cluster 3 (home comfort, social goods): Generally follows the global pattern, but shows no Christmas-related increase. Like Cluster 0, it also reflects delayed effects of the earthquake. A slight dip in February results from no recorded sales between 2013 and 2015 for this month, aligning with its typical level shifts.
- Cluster 4 (alcoholic beverages): Falls slightly below the global average but with clear spikes around alcohol-related events—especially Carnival and its preceding Saturday (days in February or at the start of March), Christmas and New Year's Eve. A smaller spike in late July stems from high values in 2015, though no known holiday or event aligns with it.
- Cluster 5 (magazines and mixed products): Roughly resembles the global pattern, but with lower activity around Christmas. The March and October spikes are due to singular outliers which lack obvious explanations, while the May spike, may again relate to the 2016 earthquake, as it occurs only in this year.
In summary, we observe variation in shapes—not so much in the form of clear, wave-like yearly patterns, but rather through recurring calendar effects. This also explains the much lower PACF value at the yearly lag compared to the weekly one, as seen in the PACF distribution plot across all series at the beginning of this clustering section. Cluster 2 shows virtually no recurring pattern at all.
This cluster-based view helped us uncover data-specific effects from exceptional events (like the earthquake) and calendar-based behaviors (like Carnival). Through this inspection, we gained valuable insights and inspiration for relevant feature engineering.
Since the clustering appears to reflect meaningful distinctions, we now proceed to create a pipeline for it.
Clustering Transformer ↑¶
In this subsection we build the clustering pipeline, consisting of our shape-aware but scale-unaware GroupStandardScaler following a new custom transformer for the clustering, which turend out to be a bit more complex than usual; the reasons are described below its definition.
class ShapeClusteringTransformer(BaseEstimator, TransformerMixin):
"""
Expects same number of series when fitted and transformed. This makes it cross validation friendly,
since the training window does not need to be defined (theoretically, one could deduce the number of series from that).
"""
def __init__(self, num_series: int, week_period: int = 7, week_trend: int = 101, week_seasonal: int = 101,
year_period: int = 365, year_seasonal: int = 1095, num_clusters: int = 6, use_soft: bool = True,
fit_window_frac: float = 1.0, tau: float = 1.0):
self.week_period = week_period
self.week_trend = week_trend
self.week_seasonal = week_seasonal
self.year_period = year_period
self.year_seasonal = year_seasonal
self.num_series = int(num_series)
self.num_clusters = num_clusters
self.use_soft = use_soft
self.fit_window_frac = fit_window_frac # fraction of training window used for fit(), helpful for cross val
self.tau = tau
def fit(self, X, y=None):
X = self.check_shape(X)
X_reshaped = X.reshape(self.num_series, -1)
# define fitting window, helpful for cross-validation
fit_window_size = int(X_reshaped.shape[1] * self.fit_window_frac)
is_zero, X_nonzero = self.constant_zero_handling(X_reshaped[:, :fit_window_size])
feature_sets = self.STL_extraction(X_nonzero) # (trends, weekly, yearly)
feature_sets_names = ["trends", "weekly_seasonals", "yearly_seasonals"]
# reduce features separately and store a tuple of each component's reduced features and pipeline as values in a dict
# with feature_sets_names keys
reduced = series_dim_reduction(feature_sets=feature_sets, feature_sets_names=feature_sets_names, n_components=20)
#self.pipelines_ = [reduced[name][1] for name in feature_sets_names] # extract the pipeline for each reduction
series_shapes = np.concatenate([reduced[name][0] for name in feature_sets_names], axis=1)
kmeans = KMeans(n_clusters=self.num_clusters, random_state=14).fit(series_shapes)
dists = kmeans.transform(series_shapes)
labels = kmeans.predict(series_shapes)
self.unique_out_ = self.make_unique_outputs(is_zero=is_zero, dists=dists, labels=labels)
return self
def transform(self, X, y=None):
"""Expects the same number and sorting order of series in X as in fit"""
check_is_fitted(self, ["unique_out_"])
X = self.check_shape(X)
window_size = X.reshape(self.num_series, -1).shape[1]
# repeat features over window size so each timestep has the same per-series shape features
return np.repeat(self.unique_out_, window_size, axis=0)
def constant_zero_handling(self, X: np.ndarray):
"""Expects an array X with shape [num_series, window]"""
is_zero = (X == 0).all(axis=1)
X_nonzero = X[~is_zero]
return is_zero, X_nonzero
def STL_extraction(self, X: np.ndarray):
"""
Extracts the trend, weekly and yearly seasonality of each row in X. Expects X of shape [num_series, training window]
"""
results = Parallel(n_jobs=-1, backend="loky")(delayed(extract_components)(series)
for series in X)
# return lists as arrays of shape num_series, length of considered window
return map(np.array, zip(*results))
def soft_memberships_from_distances(self, dists, tau=1.0, eps=1e-8):
"""
Applies the softmax function on the negative distances to compute the probabilities that a series
belongs to each cluster (higher absolute distance -> lower probability). We could tune tau to make it more one-hot-like
(tau<1) or more uniform (tau>1).
"""
# dists: shape [n_samples, n_clusters], non-negative
z = -dists / max(tau, eps)
z = z - z.max(axis=1, keepdims=True) # stabilize to prevent underflow to exactly 0
expz = np.exp(z)
return expz / (expz.sum(axis=1, keepdims=True) + eps)
def make_unique_outputs(self, is_zero: np.ndarray, dists: np.ndarray, labels: np.ndarray):
"""Wrapper for creating the correct output"""
if self.use_soft:
dists = self.check_numpy(dists)
memberships = self.soft_memberships_from_distances(dists, tau=self.tau) # normalize the distances
out = np.zeros((self.num_series, self.num_clusters + 1), dtype=np.float32)
out[~is_zero, :self.num_clusters] = memberships # zero-only series stay zero (no membership to any cluster)
out[is_zero, self.num_clusters] = 1.0 # flag # explicit flag for zero-only series in last column
else:
labels = self.check_numpy(labels)
out = np.zeros((self.num_series, 2), dtype=np.float32) # one col for the labels and one for the zero-only flag
out[~is_zero, 0] = labels.astype(np.float32)
out[is_zero, 0] = -1 # "cluster label"
out[is_zero, 1] = 1.0 # 1 flag
return out
def get_feature_names_out(self, input_features=None):
if self.use_soft:
return np.asarray([f"membership_cluster{i}" for i in range(self.num_clusters)] + ["is_constant_zero"])
else:
return np.asarray(["shape_cluster", "is_constant_zero"])
def check_numpy(self, X: any):
"""
Checks if X is a Series or DataFrame and if so converts it to an array. If it is an array, return as is.
If X is of another type tries to convert X to an array and throws an error if this attempt fails.
"""
if hasattr(X, "to_numpy"):
return X.to_numpy()
if isinstance(X, np.ndarray):
return X
# last-resort or clear error:
try:
return np.asarray(X)
except Exception as e:
raise TypeError(f"Unsupported X type: {type(X)}") from e
def check_shape(self, X: any):
"""
Checks if the shape is in the correct form and standardizes it if not. Also checks, if all series
have the same length and raises an error if not.
"""
X = self.check_numpy(X)
assert ((X.ndim == 2 and X.shape[1] == 1) or (X.ndim == 1)) #is X of shape (n, 1) or 1D?
assert hasattr(self, "num_series"), "self.num_series not set."
if X.ndim == 1:
X = X.reshape(-1, 1) # standardized X
# check equal series length
if X.size % self.num_series != 0:
raise ValueError(f"Length {X.size} not divisible by num_series={self.num_series}. "
"Check grouping/window alignment.")
return X
Design note: why this transformer is “more involved”
Clustering pipeline (custom steps):
GroupwiseStandardScaler– scales sales per group (stores/families) to remove level effects.ShapeClusteringTransformer– STL → per-component PCA → k-means → soft memberships, plus a clean handling of constant-zero series and cross-validation-safe windowing.
Why a single transformer for STL → PCA → k-means?
- Shared special case: Constant-zero series affect multiple stages (STL, PCA, k-means). Keeping them in one class lets us compute one mask once, apply it consistently, and expose a simple output (
memberships + is_constant_zero). - Per-component reduction: I reduce trend, weekly, and yearly components separately so each contributes equally to clustering. Doing this across multiple transformers would require custom glue to pass lists of arrays and keep shapes aligned.
- Leakage control and efficiency: The transformer is fitted on the training window (or a training fraction for CV), stores the resulting per-series memberships, and only repeats them over the requested window length in transform. This ensures validation/test features never use information beyond the fitted context.
- Model-friendly features: For modeling, I return soft cluster memberships (probability-like) and a binary
is_constant_zeroflag, repeated per timestep as needed.
This is intentionally a more advanced transformer than typical sklearn utilities. It trades some modularity for clarity, numeric stability, and efficiency.
We now create features for the effects we uncovered during our clustering analysis, starting with the start/end-of-month effect, which we will examine more closely in the next section.
from typing import Optional
from matplotlib.axes import Axes
def plot_with_band(data: pd.DataFrame, xvar: str, label_band: str, label_line: str=None, linestyle: str="-",
color_line: str=None, color_band: str=None, linewidth: int=2, alpha: float=0.2,
ax: Optional[Axes] = None, **kwargs):
if ax is None:
ax = plt.gca()
ax.plot(data[xvar], data["mean"], label=label_line, linestyle=linestyle, color=color_line, linewidth=linewidth)
ax.fill_between(data[xvar],
data["mean"] - data["std"],
data["mean"] + data["std"],
alpha=alpha,
color=color_band,
linestyle=linestyle,
label=label_band)
train_day_agg = (train[~train.date.dt.strftime("%m-%d").isin(["01-01", "12-25"])] # these days bias the pattern heavily
.groupby("day")["sales_scaled"]
.agg(["mean", "std"])
.reset_index())
fig, ax = plt.subplots(figsize=(10, 4))
plot_with_band(train_day_agg, "day", label_line="Mean", label_band="±1σ")
plt.xlim(1, 31); plt.ylim(-1.5, 1.5);
xticks_labels = [1] + list(range(5, 31, 5))
plt.xticks(ticks=xticks_labels, labels=xticks_labels)
plt.xlabel("Calendar Day of Month"); plt.ylabel("Mean of Scaled Sales")
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.12),
ncol=2,
frameon=False);
The calendar day of month shows a clear effect: values peak at the start, dip toward the middle, and rise again before the end—consistent with salary payment cycles. While the strength of this pattern varies across clusters (e.g., weak in Cluster 2, sharp at month-end in Cluster 4), a global day-of-month feature combined with the cluster assignment allows the model to learn these differences.
Distance-to-start and distance-to-end of month turned out to be almost perfectly linear mirrors (correlation ≈ 0.97 after aligning their mean values). Since distance-to-start is just day-of-month shifted by one, we keep only calendar day of month as a compact feature.
We already noticed that there is no strong wave-like yearly recurrence when inspecting the clusters. Perhaps this becomes more visible when we aggregate to month instead of by day of year?
train_month_agg = (train.groupby(["month"], observed="False")["sales_scaled"]
.agg(["mean", "std"])
.reset_index())
fig, ax = plt.subplots(figsize=(10, 4))
plot_with_band(train_month_agg, "month", label_line="Mean", label_band="±1σ")
plt.xticks(ticks=np.arange(0, 12, 1), labels=list(calendar.month_abbr)[1:])
plt.xlabel("Calendar Month"); plt.ylabel("Scaled Sales")
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.12),
ncol=2,
frameon=False);
Even here, the pattern appears rather flat, apart from the December spike caused by Christmas. February and August show slightly lower values, and there is a small bump in July which does not clearly align with calendar effects.
To keep things pragmatic, we will include the month as a feature and later evaluate whether it meaninguflly contributes to the model's performance.
We now continue with closer analyzing the effect of holidays.
locale_types = pd.concat([
(train.loc[~train.date.dt.strftime("%m-%d").isin(["12-25", "01-01"])] # on these days a lot of shops are closed
.groupby(f"{level}_type", observed=False)["sales_scaled"]
.agg(["mean", "std"])
.assign(locale_level=level))
for level in locale_levels
],
axis=0).reset_index(names="type")
g = sns.catplot(locale_types, x="type", y="mean", col="locale_level", kind="bar", edgecolor="darkgrey");
for ax, level in zip(g.axes.flat, g.col_names):
sub = g.data[g.data.locale_level==level]
ax.errorbar("type", "mean", yerr="std", fmt="none", capsize=0, lw=1, color=sns.color_palette("deep")[0],
data=sub)
ax.tick_params("x", labelrotation=35)
ax.set_xlabel("Day Type")
ax.set_ylabel("Scaled Sales")
ax.set_title(level.capitalize())
ax.set_ylim(-1.5, 2.5)
plt.suptitle("Mean Sales ± Standard Deviation", fontsize=14, y=1.08);
plt.subplots_adjust(wspace=0.1)
Holidays of all types are generally associated with above-average sales (although not all types occur on each locale level), most notably additional holidays at the national level. The only exception are national Transferred Holidays, which consistently show below-average sales. This cannot be explained by weekday patterns or by proximity to the actual celebrated holiday. The mechanism is unclear, but its relevance for modeling is likely limited.
Still, these features may be useful, as they capture the strength of different holiday types across locale levels—especially when combined with features such as the distance to any holiday. We examine this in a raw and a smoothed version next, since the raw version still strongly reflects the weekly pattern, what makes interpretation difficult.
# function to create raw and smoothed versions
def create_raw_smooth_df(df:pd.DataFrame, grouper: str, aggs: list[str]=["mean", "std"]):
base_df = df.groupby(grouper, observed=True)["sales_scaled"].agg(aggs)
raw_smooth_df = pd.concat(
[base_df.assign(values="Raw"),
base_df.rolling(7, center=True).mean().assign(values="Smoothed")], # 7 day window filters out weekly pattern
axis=0).reset_index()
return raw_smooth_df
# use function. Stores are closed on Christmas and New Year
holiday_dist_df = create_raw_smooth_df(
train[~train.date.dt.strftime("%m-%d").isin(["12-25", "01-01"])],
"dist_any_holiday"
)
# create a wrapper for distance plots
def beautify_dist_plot(zoom: int, interval: int, title_event: str, grid=g, y_min: int=None, y_max: int=None):
xticks_labels=np.arange(-zoom, zoom + interval, interval)
for ax in grid.axes.flat:
ax.set_xlabel("Distance in Number of Days")
ax.set_ylabel("Sales Scaled")
ax.set_xlim(-zoom, zoom);
ax.set_xticks(xticks_labels, xticks_labels);
ax.set_ylim(y_min, y_max);
grid.set_titles(template="{col_name} Values" if grid.row_names==[] else "{col_name} Values for {row_name}")
handles, labels = ax.get_legend_handles_labels()
grid.fig.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.08),
ncol=2,
frameon=False);
plt.suptitle(f"Raw and Smoothed Values of Sales Means by Distance to {title_event}", y=1.12);
# function which combines the previous defined functions
def plot_dist_with_band(df: pd.DataFrame, dist_col: str, title_event: str, zoom: int=42, interval: int=7, height: float=4,
aspect: float=1.5, y_min=None, y_max=None):
df_raw_smooth = create_raw_smooth_df(df, dist_col)
g = sns.FacetGrid(df_raw_smooth, col="values", height=height, aspect=aspect)
g.map_dataframe(plot_with_band, xvar=dist_col, label_line="Mean", label_band="±1σ")
beautify_dist_plot(zoom, interval, title_event, grid=g, y_min=y_min, y_max=y_max)
plot_dist_with_band(train[~train.date.dt.strftime("%m-%d").isin(["12-25", "01-01"])], # stores are closed on these days
"dist_any_holiday", "Any Holiday", 35)
There is still a clear positive effect on the holiday itself (x=0), and—as expected—a short recovery effect of around 2-4 days. The anticipation effect lasts nearly twice as long but peaks two days before the holiday, perhaps because many people avoid shopping on the day immediately before, expecting crowded stores. For these holidays (excluding Christmas), peaks at longer distances reflect the weekly cycle (lag=7), as they do not appear in the smoothed version.
But how does this look for Christmas?
plot_dist_with_band(train, "dist_christmas", "Christmas", 42)
Christmas (x=0) shows a distinct two-wave anticipation effect: it first extends the end/start-of-month effect at the beginning of December and then drives a strong rise in sales in the two weeks before Christmas. The recovery effect is less clear, as it overlaps with New Year and the January end/start-of-month effect, but it may last until the first week of January (x = 12).
Since distance to New Year is essentially a shifted version of this pattern, it is sufficient to add the flags is_new_year (and possibly is_christmas) to capture the sharp sales drops when most stores are closed. For our first tree-based model, no clipping is required, as it will naturally split at the relevant distances.
plot_dist_with_band(train, "dist_viernes_santo", "Good Friday", y_min=-1.5, y_max=2)
Surprisingly, from a European perspective, there seems to be no positive effect around Good Friday or Easter—at most only a small negative one. In Ecuador, Semana Santa and Easter itself are deeply religious events centered on abstinence rather than consumption; Easter Sunday and Easter Monday are not even official holidays.
Since I am unsure whether this small negative effect will be beneficial for modeling—perhaps only by helping to estimate the impact of other holidays more precisely—I will nvertheless include this distance as a feature in the first model.
Next we look at the days that were redeclared as work days to recompensate for additional or bridge days. These are all Saturdays, so one might think that we could simply compare them with regular Saturday. However, this would be misleading, since there are only five such days in the training set and three fall in January which typically shows lower values. Therefore, we also look at the distance plots here, as there might be anticipation or recovery effects (for example, sales on the following Sunday could be higher than usual).
plot_dist_with_band(train, "dist_workday", "Saturdays redeclared to Work Days", zoom=56, y_min=-2, y_max=3)
It appears that there is a negative effect on the day itself, but the leading above-average and trailing below-average seem to reflect the pre- and post-Christmas periods. We fit a small linear regression including the flag for redeclared Saturdays as well as weekday, month and days elapsed as control variables.
import statsmodels.formula.api as smf
model = smf.ols(
"sales_scaled ~ C(weekday) + C(month) + days_elapsed + C(is_workday)",
data=train
).fit(cov_type="HC3") # HC3 = robust SEs
print(model.summary(slim=True))
OLS Regression Results
==============================================================================
Dep. Variable: sales_scaled R-squared: 0.150
Model: OLS Adj. R-squared: 0.150
No. Observations: 2983068 F-statistic: 2.451e+04
Covariance Type: HC3 Prob (F-statistic): 0.00
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept -0.7862 0.002 -337.900 0.000 -0.791 -0.782
C(weekday)[T.Tuesday] -0.0725 0.002 -39.716 0.000 -0.076 -0.069
C(weekday)[T.Wednesday] -0.0771 0.002 -42.784 0.000 -0.081 -0.074
C(weekday)[T.Thursday] -0.1613 0.002 -89.451 0.000 -0.165 -0.158
C(weekday)[T.Friday] -0.0222 0.002 -12.011 0.000 -0.026 -0.019
C(weekday)[T.Saturday] 0.3650 0.002 179.018 0.000 0.361 0.369
C(weekday)[T.Sunday] 0.3635 0.002 170.108 0.000 0.359 0.368
C(month)[T.February] -0.0603 0.002 -25.082 0.000 -0.065 -0.056
C(month)[T.March] -0.0011 0.002 -0.456 0.648 -0.006 0.004
C(month)[T.April] -0.0632 0.002 -25.916 0.000 -0.068 -0.058
C(month)[T.May] -0.0515 0.002 -20.991 0.000 -0.056 -0.047
C(month)[T.June] -0.0692 0.002 -29.403 0.000 -0.074 -0.065
C(month)[T.July] -0.0231 0.002 -9.795 0.000 -0.028 -0.018
C(month)[T.August] -0.0914 0.002 -38.204 0.000 -0.096 -0.087
C(month)[T.September] 0.0121 0.003 4.735 0.000 0.007 0.017
C(month)[T.October] -0.0015 0.003 -0.583 0.560 -0.007 0.004
C(month)[T.November] -0.0170 0.003 -6.763 0.000 -0.022 -0.012
C(month)[T.December] 0.2161 0.003 67.768 0.000 0.210 0.222
C(is_workday)[T.True] -0.1022 0.011 -9.635 0.000 -0.123 -0.081
days_elapsed 0.0007 1.12e-06 582.297 0.000 0.001 0.001
===========================================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 2.32e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
The results support the hypothesis that these Saturdays have a moderate negative effect, though—as often with multiple time series—mulitcollinearity may bias the coefficients, so they should interpreted with caution. Nevertheless, we will keep the distance to the redeclared Saturdays as a feature and let the models decide.
In the following section we turn to the effects of events.
We start with examining the effects of recurring events such as Black Friday, Cyber Monday and Mother's Day.
dist_recur_events = ["dist_black_friday", "dist_cyber_monday", "dist_dia_de_la_madre"]
dist_recur_events_df = pd.concat(
[create_raw_smooth_df(train, dist_event).rename(columns={dist_event: "dist"})
.assign(event=dist_event.removeprefix("dist_")
.replace("_", " ")
.title())
for dist_event in dist_recur_events],
axis=0)
g = sns.FacetGrid(dist_recur_events_df,
row="event", col="values",
height=3, aspect=2);
g.map_dataframe(plot_with_band, xvar="dist", label_line= "Mean", label_band="±1σ");
beautify_dist_plot(35, 7, "Recurring Events", grid=g)
Neither Black Friday, Cyber Monday nor Mother's Day (Día de la Madre) show any effect on sales. The small bulks before and after Black Friday and Cyber Monday reflect the end/start-of-month effects of November and December. The higher level for these two events is an artifact of the trend, as these events only began in 2014. Therefore, there is no need to add these features to our model. Which effect do singular events as the earthquake have?
plot_dist_with_band(train, "dist_terremoto_manabi", "Earthquake", zoom=63, y_min=-2, y_max=3)
The effect of the earthquake that struck the region Manabí on 16 April 2016 is hard to isolate, as it is preceded by the April end/start-of-month effect (around x=-15) and followed by the May effect (x=15), with which it appears to overlap. The impact may even have lasted until June. Still, the wider standard deviation band of the 28 days after the event reflects that some regions were highly impacted while others not at all. The city with the strongest sales peak after the earthquake is Daule, located close near the border of the Manabí region.
Nevertheless, our models will capture the duration reliably once we add this as a feature in order to facilitate that they learn how high sales are under normal circumstances. Next, we inspect the distance to matches of the Soccer World Cup 2014.
plot_dist_with_band(train, "dist_mundial", "Any Soccer World Cup Matches", zoom=35, y_min=-1.5, y_max=1.5)
Visually, it is difficult to determine whether matches had an effect (12 June-13 July 2014). The raw pattern seems to shift around x=0, and the level differs within 14 days before and after a match, but this is unlikely to be caused by the matches themselves and could instead reflect end/start-of-month effects. The generally lower level is most likely an artifact of sales growth, as 2014 is only the second year in the dataset.
We also have a column encoding the World Cup stage. Simply grouping and checking sign and magnitude would be misleading, since each stage has only a few matches per stage and weekday effects could confound the results. To adress this, we build a small regression model, that controls for weekday and the trend and compares the stage values (0-6) with days without a match (-1):
import statsmodels.formula.api as smf
model = smf.ols(
"sales_scaled ~ C(weekday) + days_elapsed + C(mundial_stage)",
data=train
).fit(cov_type="HC3") # HC3 = robust SEs
print(model.summary(slim=True))
OLS Regression Results
==============================================================================
Dep. Variable: sales_scaled R-squared: 0.145
Model: OLS Adj. R-squared: 0.145
No. Observations: 2983068 F-statistic: 3.212e+04
Covariance Type: HC3 Prob (F-statistic): 0.00
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept -0.8053 0.002 -466.787 0.000 -0.809 -0.802
C(weekday)[T.Tuesday] -0.0718 0.002 -39.067 0.000 -0.075 -0.068
C(weekday)[T.Wednesday] -0.0763 0.002 -42.013 0.000 -0.080 -0.073
C(weekday)[T.Thursday] -0.1605 0.002 -88.396 0.000 -0.164 -0.157
C(weekday)[T.Friday] -0.0229 0.002 -12.292 0.000 -0.027 -0.019
C(weekday)[T.Saturday] 0.3603 0.002 176.964 0.000 0.356 0.364
C(weekday)[T.Sunday] 0.3639 0.002 169.581 0.000 0.360 0.368
C(mundial_stage)[T.0] -0.1355 0.018 -7.594 0.000 -0.170 -0.101
C(mundial_stage)[T.1] -0.2851 0.010 -28.231 0.000 -0.305 -0.265
C(mundial_stage)[T.2] -0.0801 0.010 -8.189 0.000 -0.099 -0.061
C(mundial_stage)[T.3] 0.2067 0.014 14.780 0.000 0.179 0.234
C(mundial_stage)[T.4] 0.1272 0.011 11.062 0.000 0.105 0.150
C(mundial_stage)[T.5] 0.1206 0.022 5.474 0.000 0.077 0.164
C(mundial_stage)[T.6] -0.0071 0.020 -0.348 0.728 -0.047 0.033
days_elapsed 0.0007 1.12e-06 585.742 0.000 0.001 0.001
===========================================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 5.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Early stages show negative effects, perhaps because people stayed at home to watch the matches (especially when Ecuador played, which ocurred only at stage 1). In contrast, matches from the quarterfinals to the semifinals (stages 3-5) show substantial positive effects, while the final's effect is indistinguishable from zero. Nevertheless, the results should be interpreted with caution, as the coefficients may be biased by mulitcollinearity.
Overall, the effect does not appear to be too strong. Still, the stage feature, together with the distance-to-any-match feature, could prove useful when combined with other features (such as product families and shape clusters) and we will let the models decide.
Others ↑¶
We also have some store-level attributes in the columns, such as city, state, type, and the provided cluster (not our shape clusters). Since the target variable is groupwise-scaled per series, their direct influence on the scaled values cancels out. We already analyzed the relationship between these columns and raw sales in the target preprocessing section, with little insight gained.
While we would expect the shape clusters to explain the general patterns, these supra-series features might help capture intra-series characteristics such as outliers. We will therefore add them to our model and later evaluate their importance scores.
There is, however, one more feature we have not looked at so far: promotions. For each day we know how many articles of a product family in a given store were promoted (and will be promoted in the future). The joint distribution of scaled sales and the onpromotion column looks like this:
plt.figure(figsize=(16, 5))
sns.scatterplot(train, x="onpromotion", y="sales_scaled", alpha=0.5);
plt.xlabel("Promoted Articles")
plt.ylabel("Scaled Sales");
There is a hyperbolic-like structure with high heteroscedasticity: lower numbers of promoted articles show far higher variance in sales than higher numbers. Since the relationship is non-linear, the product-momentum correlation between these variables is only moderate (0.16). Nevertheless, we expect our non-linear models to capture this relationship well, so we include it as a feature.
We now draw a sample based on the different shape clusters. To do this, we apply the clustering pipeline and store its outcome in fixed columns in train. However, we apply it only to a large initial part of the training window to ensure that all samples and subsets drawn during training—e.g., in cross-validation—have consistent cluster memberships without risking future leakage, since validation sets will always be drawn beyond this first segment of the training window. This approach allows us to perform the ressource-intensive fit and transform only once, instead of repeating it for every validation split.
In the final pipeline, we will include clustering_pipeline on the full training set, so that it can also transform test, which includes the training window. But for now we simply do:
train_sub_window_ratio = 0.85
n_clusters = 6
clustering_pipeline_cv = make_pipeline(
GroupStandardScaler(num_groups=num_series, inverse_sorted_by_group=True),
ShapeClusteringTransformer(num_series=num_series, num_clusters=n_clusters, fit_window_frac=train_sub_window_ratio,
use_soft=True)
)
suffix = int(train_sub_window_ratio * 100)
cluster_col_names = [f"membership_cluster{i}_{suffix}" for i in range(n_clusters)] + [f"is_constant_zero_{suffix}"]
cluster_out = clustering_pipeline_cv.fit_transform(train["sales"])
train[cluster_col_names] = cluster_out
train[cluster_col_names]
| membership_cluster0_85 | membership_cluster1_85 | membership_cluster2_85 | membership_cluster3_85 | membership_cluster4_85 | membership_cluster5_85 | is_constant_zero_85 | |
|---|---|---|---|---|---|---|---|
| 0 | 1.398860e-14 | 0.999998 | 9.187750e-09 | 4.852531e-11 | 8.424205e-28 | 2.451601e-06 | 0.0 |
| 1 | 1.398860e-14 | 0.999998 | 9.187750e-09 | 4.852531e-11 | 8.424205e-28 | 2.451601e-06 | 0.0 |
| 2 | 1.398860e-14 | 0.999998 | 9.187750e-09 | 4.852531e-11 | 8.424205e-28 | 2.451601e-06 | 0.0 |
| 3 | 1.398860e-14 | 0.999998 | 9.187750e-09 | 4.852531e-11 | 8.424205e-28 | 2.451601e-06 | 0.0 |
| 4 | 1.398860e-14 | 0.999998 | 9.187750e-09 | 4.852531e-11 | 8.424205e-28 | 2.451601e-06 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3007997 | 6.475207e-01 | 0.000026 | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 0.0 |
| 3007998 | 6.475207e-01 | 0.000026 | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 0.0 |
| 3007999 | 6.475207e-01 | 0.000026 | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 0.0 |
| 3008000 | 6.475207e-01 | 0.000026 | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 0.0 |
| 3008001 | 6.475207e-01 | 0.000026 | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 0.0 |
2983068 rows × 7 columns
For faster experimentation (especially when training neural networks), we randomly draw the same number of series from each cluster we have just found plus a proportional number of the zero-only series. Compared to a fully representative sampling, where the smallest shape cluster would be represented by only 18 time series and thus lack diversity, this mixed approach ensures sufficient variation within each cluster while keeping the sample size small, and at the same time forces the model to learn to detect zero-only series. Based on the training on this sample, we will be able to have a preliminary indication of which model class will perform best on the full dataset.
We could also exclude the constant zero series and forecast them manually, but including them keeps the pipeline simpler, as the model can easily learn this pattern via the is_constant_zero flag.
# define number of series per cluster
n_nonzero = 50 # for each cluster to avoid bias caused by large clusters
# define number of zero-only series proportional to proportion of non-zero samples to all non-zero series
num_zero_series = len(zero_ids)
n_zero = np.round(
(n_clusters * n_nonzero) / (num_series - num_zero_series) * num_zero_series,
0
).astype(int).item()
def get_random_sample_series(X: pd.DataFrame, cluster_cols: list[str], id_col: str="series_id", n_nonzero: int=n_nonzero,
n_zero: int=n_zero, seed=14) -> np.ndarray:
"""
Randomly selects n_nonzero series per cluster (based on membership scores)
and concatenates them with n_zero random zero-only series.
Returns the combined series_ids.
"""
# reconstruct ids and labels from the data itself
is_zero_col = cluster_cols[-1]
membership_cols = cluster_cols[:-1]
X_unique = X.drop_duplicates(subset=id_col, keep="first") # identical scores for one series
nonzero_series, zero_series = [X_unique.loc[(X_unique[is_zero_col] == i)] for i in (0, 1)]
# get cluster labels only for nonzero series
labels = np.argmax(nonzero_series[membership_cols].to_numpy(), axis=1)
nonzero_series_ids, zero_series_ids = [df[id_col].to_numpy() for df in (nonzero_series, zero_series)]
# draw sample from nonzero series
rng = np.random.default_rng(seed=seed)
keep_mask = np.zeros(labels.size, dtype=bool)
n_cluster = np.unique(labels).size
for c in range(n_cluster):
idx = np.where(labels==c)[0]
assert n_nonzero <= idx.size, f"Requested more nonzero series from cluster {c} than available"
sample_idx = rng.choice(idx, n_nonzero, replace=False)
keep_mask[sample_idx] = True
selected_nonzero = nonzero_series_ids[keep_mask]
# draw sample from zero-only series
assert n_zero <= zero_series_ids.size, "Requested more zero-only series than available"
selected_zero = rng.choice(zero_series_ids, n_zero, replace=False)
return np.concatenate([selected_nonzero, selected_zero], axis=0)
sample_series = get_random_sample_series(train, cluster_cols = cluster_col_names)
print(f"Number of sample series: {sample_series.size}\n"
f"Number of zero-only series in the sample: {n_zero}")
Number of sample series: 309 Number of zero-only series in the sample: 9
The result indeed equals the number of clusters multiplied by the number of series drawn from each cluster, plus the number of series that are constantly zero. We now only need to define which columns we want to include in the pipeline for our first model and then only need to filter the rows and columns of train to create the sample.
cat_basic_cols = ["store_nbr", "family", "city", "type", "cluster"]
num_basic_cols = ["onpromotion"]
periodic_calendar_cols = ["weekday", "day", "month", "day_of_year"]
num_calendar_cols = ["days_elapsed", "year"]
cat_holiday_cols = ["is_new_year", "local_type", "regional_type", "national_type", "is_leap_year"]
num_holiday_cols = ["mundial_stage"]
dist_holiday_cols = ["dist_workday", "dist_local_holiday", "dist_regional_holiday", "dist_national_holiday",
"dist_any_holiday", "dist_christmas", "dist_viernes_santo", "dist_terremoto_manabi",
"dist_mundial_ecuador"]
feature_cols = (cat_basic_cols
+ num_basic_cols
+ periodic_calendar_cols
+ num_calendar_cols
+ cat_holiday_cols
+ num_holiday_cols
+ dist_holiday_cols
+ cluster_col_names # already defined at the end of clustering section
+ ["sales"]) # needed later to calculate lagged features and will be dropped afterwards
X = train.loc[train.series_id.isin(sample_series), feature_cols]
X.shape
(517266, 35)
y = train.loc[train.series_id.isin(sample_series), ["sales"]]
y.shape
(517266, 1)
We can now proceed with creating a pipeline for our baseline model: A simple LightGBM model. LightGBM—being an efficient tree-based model—typically outperforms linear models when the data is complex and forecasts result from numerous interacting features, as in this case: multiple seasonalities, holiday types, distances to events, promotions, and others. This makes it a fast baseline that is hard to beat.
I implemented a transformer for cyclical time features (weekday, month, day of year) using Fourier encodings with different numbers of harmonics. Encoding day_of_year with Fourier terms improved the folds around the year boundary, because it models yearly seasonality continuously rather than as 366 independent categories, which makes splits harder for the model.
Even for only seven categories, Fourier encoding the weekday captured the weekly pattern better, consistently improving results.
For LightGBM, however, categorical encoding for day of month and calendar month (day, month) worked best.
from pandas.api.types import is_integer_dtype
KNOWN_PERIODS_HARMONICS_AND_CATEGORIES = {
"weekday": (7, 2, weekdays),
"day": (30.44, 2, np.arange(1, 32)), # not used
"month": (12, 2, months), # not used
"day_of_year": (365.2425, 3, np.arange(1, 367))
}
def create_fouriers(X: pd.DataFrame, periods_harmonics: dict) -> np.ndarray:
"""
Compute sine–cosine fourier encodings for temporal features.
Accepts DataFrames with ordered categorical or integer Series (e.g., weekday, month).
Returns a NumPy array with sin and cos values for each column of X using the given period and number of harmonics.
"""
assert isinstance(X, pd.DataFrame), "Feature needs to be a DataFrame for correct handling"
encodings = []
for col in X:
feature = X[col]
period, harmonics, categories = periods_harmonics.get(col)
if period is None or harmonics is None or categories is None:
raise ValueError(
f"Provide period, number of harmonics and categories for feature {col} in periods_harmonics"
)
feature = pd.Series(pd.Categorical(feature, categories=categories, ordered=True))
for k in range(1, harmonics + 1):
codes = feature.cat.codes.to_numpy() # integer representation
theta = 2 * np.pi * codes * k / period
encodings.append(np.column_stack([np.sin(theta), np.cos(theta)]))
return np.concatenate(encodings, axis=1).astype("float32")
# custom class for proper feature names and potential tuning
class FourierTransformer(BaseEstimator, TransformerMixin):
def __init__(self, periods_harmonics: dict = None):
self.periods_harmonics = periods_harmonics
def fit(self, X, y=None):
return self # Nothing to learn here
def transform(self, X):
return create_fouriers(X, periods_harmonics = self.periods_harmonics)
def get_feature_names_out(self, input_features=None):
return [
name for inp in input_features
for k in range(1, self.periods_harmonics.get(inp)[1] + 1)
for name in [f"{inp}_sin_{k}", f"{inp}_cos_{k}"]
]
fourier_transformer = FourierTransformer(periods_harmonics=KNOWN_PERIODS_HARMONICS_AND_CATEGORIES)
pd.DataFrame(fourier_transformer.fit_transform(X[["day", "day_of_year"]]),
columns= fourier_transformer.get_feature_names_out(["day", "day_of_year"]))
| day_sin_1 | day_cos_1 | day_sin_2 | day_cos_2 | day_of_year_sin_1 | day_of_year_cos_1 | day_of_year_sin_2 | day_of_year_cos_2 | day_of_year_sin_3 | day_of_year_cos_3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 |
| 1 | 0.204950 | 0.978773 | 0.401198 | 0.915991 | 0.017202 | 0.999852 | 0.034399 | 0.999408 | 0.051585 | 0.998669 |
| 2 | 0.401198 | 0.915991 | 0.734988 | 0.678080 | 0.034399 | 0.999408 | 0.068757 | 0.997633 | 0.103033 | 0.994678 |
| 3 | 0.580414 | 0.814322 | 0.945287 | 0.326240 | 0.051585 | 0.998669 | 0.103033 | 0.994678 | 0.154207 | 0.988039 |
| 4 | 0.734988 | 0.678080 | 0.996762 | -0.080414 | 0.068757 | 0.997633 | 0.137188 | 0.990545 | 0.204970 | 0.978768 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 517261 | -0.651878 | 0.758324 | -0.988669 | 0.150111 | -0.422847 | -0.906201 | 0.766369 | 0.642400 | -0.966122 | -0.258087 |
| 517262 | -0.482622 | 0.875829 | -0.845388 | 0.534153 | -0.438373 | -0.898793 | 0.788013 | 0.615658 | -0.978149 | -0.207905 |
| 517263 | -0.292876 | 0.956150 | -0.560067 | 0.828447 | -0.453769 | -0.891119 | 0.808725 | 0.588187 | -0.987572 | -0.157170 |
| 517264 | -0.090697 | 0.995879 | -0.180645 | 0.983548 | -0.469031 | -0.883182 | 0.828479 | 0.560020 | -0.994364 | -0.106017 |
| 517265 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | -0.484154 | -0.874983 | 0.847253 | 0.531190 | -0.998509 | -0.054581 |
517266 rows × 10 columns
I also implemented a Transformer for clipping the temporal distance to events, in case long distances might mislead the model into using the unclipped values as a proxy for the number of days elapsed, especially for events that occur only once.
DISTANCE_CLIPPINGS = {
"dist_workday": [-7, 7], # longer effects for Saturdays redeclared to workdays are not plausible
"dist_local_holiday": [-7, 7],
"dist_regional_holiday": [-7, 7],
"dist_national_holiday": [-7, 7],
"dist_any_holiday": [-7, 7],
"dist_christmas": [-28, 14],
"dist_viernes_santo": [-56, 14], # -56 to include the yearly different carnival season
"dist_terremoto_manabi": [-1, 28], # -1 to distinguish from 0, when the earthquake stroke
"dist_mundial_ecuador": [-7, 7]
}
def clip_distances(X: pd.DataFrame):
for col in X:
if col not in DISTANCE_CLIPPINGS:
raise ValueError(f"Provide clip values for {col} in DISTANCE_CLIPPINGS")
mins = [DISTANCE_CLIPPINGS.get(col)[0] for col in X]
maxs = [DISTANCE_CLIPPINGS.get(col)[1] for col in X]
return np.clip(X, mins, maxs)
def clipped_feature_names(transformer, input_features):
return [f"{f}_clipped" for f in input_features]
clip_transformer = FunctionTransformer(
clip_distances,
feature_names_out=clipped_feature_names
)
Now we finally define the pipeline for our first model and fit it.
from sklearn.compose import ColumnTransformer, make_column_selector, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn import set_config
from lightgbm import LGBMRegressor
set_config(transform_output="pandas") # enables name-preserving DataFrame flow
# precision with float32 is enough/even beneficial
def cast2float32(X: pd.DataFrame) -> pd.DataFrame:
return X.astype("float32")
to_float32 = FunctionTransformer(cast2float32, feature_names_out="one-to-one")
# exclude target from floats to prevent using it as a feature
def selector_no_sales(df):
cols = make_column_selector(dtype_include=float)(df)
return [col for col in cols if col != "sales"]
ordinal_encoder = OrdinalEncoder(dtype=int, handle_unknown="use_encoded_value", unknown_value=-1)
cat_cols = (["day", "month"]
+ cat_holiday_cols
+ cat_basic_cols
)
fourier_cols = ["weekday", "day_of_year"]
pre = ColumnTransformer(
[("cat", ordinal_encoder, cat_cols),
("fourier", fourier_transformer, fourier_cols),
("reals", to_float32, selector_no_sales),
("dist_clip", clip_transformer, dist_holiday_cols),
("drop_sales", "drop", ["sales"])
],
remainder="passthrough",
verbose_feature_names_out=False,
)
num_series_sample = X[["family", "store_nbr"]].drop_duplicates().shape[0]
target_pipe = make_pipeline(
to_float32,
log1p_transformer,
GroupStandardScaler(num_groups=num_series_sample, inverse_sorted_by_group=True)
)
simple_lgbm = TransformedTargetRegressor(
LGBMRegressor(learning_rate=0.1, # default
num_leaves=31, # default
random_state=42,
force_row_wise=True,
verbosity=0),
transformer=target_pipe,
check_inverse=False #essential: avoid subset transform that breaks group-size assumption
)
simple_lgbm_pipe = Pipeline([
("pre", pre),
("simple_lgbm", simple_lgbm)
])
simple_lgbm_pipe.fit(X, y, simple_lgbm__categorical_feature=cat_cols)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OrdinalEncoder(dtype=<class 'int'>,
handle_unknown='use_encoded_value',
unknown_value=-1),
['day', 'month',
'is_new_year', 'local_type',
'regional_type',
'national_type',
'is_leap_year', 'store_nbr',
'family', 'city', 'type',
'cluster']),
('fourier',
FourierTransformer(periods_har...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=309))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('simple_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('cat', ...), ('fourier', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x3b42d00d0>
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x3b427fac0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x3b427fb50> | |
| kw_args | None | |
| inv_kw_args | None |
['sales']
drop
['days_elapsed', 'year', 'mundial_stage']
passthrough
Parameters
| regressor | LGBMRegressor..., verbosity=0) | |
| transformer | Pipeline(step...groups=309))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
LGBMRegressor(force_row_wise=True, random_state=42, verbosity=0)
Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 31 | |
| max_depth | -1 | |
| learning_rate | 0.1 | |
| n_estimators | 100 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0 | |
| min_child_weight | 0.001 | |
| min_child_samples | 20 | |
| subsample | 1.0 | |
| subsample_freq | 0 | |
| colsample_bytree | 1.0 | |
| reg_alpha | 0.0 | |
| reg_lambda | 0.0 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 309 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Before calculating the first prediction we need to choose an evaluation metric. Most time series in the dataset consist largely of very low daily sales (often 0 – 1 units) with only a few high-volume outliers. For such distributions, metrics that emphasise relative accuracy—such as the Root Mean Squared Logarithmic Error (RMSLE)—are more informative than purely absolute ones like RMSE, which are dominated by large series and rare peaks. To complement RMSLE, the Mean Absolute Error (MAE) is reported as it reflects average deviations in business-relevant units and remains robust against moderate outliers.
Since LightGBM’s default loss is mean squared error, training on log1p-transformed targets (see Analyzing the Target section for details) effectively minimises a loss equivalent to the RMSLE. Initial experiments confirmed that this configuration achieves the best balance between RMSLE and MAE.
When we apply the evaluation metrics on our first fit, we get:
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, root_mean_squared_log_error, median_absolute_error
pred = simple_lgbm_pipe.predict(X)
def get_metrics(true: np.ndarray|pd.Series, pred: np.ndarray|pd.Series, use_all: bool=False, print_logs: bool = True):
if np.any(pred < 0): # sanity check for rmsle which uses log1p
min_pred = np.min(pred)
pred = np.clip(pred, 0, None)
if print_logs:
print(f"Negative Predictions detected with minimum {min_pred: .2f} and clipped to 0")
if use_all:
metrics = root_mean_squared_log_error, mean_absolute_error, median_absolute_error, root_mean_squared_error
metric_keys = "RMSLE", "MAE", "MedAE", "RMSE"
else:
metrics = root_mean_squared_log_error, mean_absolute_error
metric_keys = "RMSLE", "MAE"
metric_dict = {key: np.round(metric(true, pred), 4).item() for key, metric in zip(metric_keys, metrics)}
return metric_dict
get_metrics(y, pred, use_all=False)
Negative Predictions detected with minimum -1.00 and clipped to 0
{'RMSLE': 0.5011, 'MAE': 67.6852}
Still, it is hard to decide if these are good values. Since the data shows a very strong weekly seasonality we define a naive baseline by simply using the last 14 days of each series to forecast the next 14 days.
def lag14_baseline(y, n_series):
a = np.asarray(y).reshape(n_series, -1) # shape: (n_series, T)
y_true = a[:, 14:].flatten() # from day 15 on, days t
y_pred = a[:, :-14].flatten() # predict with value at t-14
return get_metrics(y_true, y_pred)
lag14_baseline(y, num_series_sample)
{'RMSLE': 0.8168, 'MAE': 78.6676}
While both metrics show substantially better results than the baseline, using the model's predictions improved the RMSLE much more than the MAE. This is probably because MAE is dominated by high-volume series with larger absolute errors, whereas RMSLE captures relative improvements more evenly across all series.
Still, it is even more interesting to see how our model performs on data unseen during training, i.e., in a true forecast scenario. For this, we use blocked time-series cross-validation, where we train the model on equidistant folds (in our case, at least 85% of the training window) and predict the following 14 days which the model has not seen during training.
def block_tscv_gen(X: pd.DataFrame | np.ndarray, num_series, test_size, n_splits,
cv_start_frac: float = train_sub_window_ratio):
n = X.shape[0]
assert n % num_series == 0, "Number of instances not divisable by number of series"
series_len = n // num_series
cv_start = int(series_len * cv_start_frac)
cv_window = series_len - cv_start
basic_split_len = cv_window//n_splits
assert basic_split_len >= test_size, "Test size too large for given number of splits."
r = cv_window % n_splits
split_lengths = [(basic_split_len + 1) if i < r else basic_split_len for i in range(n_splits)]
last_idx = cv_start + np.cumsum(split_lengths) # the last index of each fold
idx = np.arange(n).reshape(num_series, series_len) # array with all indices aligned per series
for i in last_idx:
fold_idx = idx[:, np.arange(i)]
train_idx = fold_idx[:, :-test_size].flatten()
test_idx = fold_idx[:, -test_size:].flatten()
yield train_idx, test_idx
ts_cv = list(block_tscv_gen(X, num_series_sample, 14, 5)) # list makes it reusable (in cross_val_score and the baseline)
ts_cv
[(array([ 0, 1, 2, ..., 517048, 517049, 517050],
shape=(450831,)),
array([ 1459, 1460, 1461, ..., 517062, 517063, 517064], shape=(4326,))),
(array([ 0, 1, 2, ..., 517099, 517100, 517101],
shape=(466590,)),
array([ 1510, 1511, 1512, ..., 517113, 517114, 517115], shape=(4326,))),
(array([ 0, 1, 2, ..., 517149, 517150, 517151],
shape=(482040,)),
array([ 1560, 1561, 1562, ..., 517163, 517164, 517165], shape=(4326,))),
(array([ 0, 1, 2, ..., 517199, 517200, 517201],
shape=(497490,)),
array([ 1610, 1611, 1612, ..., 517213, 517214, 517215], shape=(4326,))),
(array([ 0, 1, 2, ..., 517249, 517250, 517251],
shape=(512940,)),
array([ 1660, 1661, 1662, ..., 517263, 517264, 517265], shape=(4326,)))]
from sklearn.model_selection import cross_val_score
lgbm_rmsles = -cross_val_score(simple_lgbm_pipe, X, y, scoring="neg_root_mean_squared_log_error", cv=ts_cv,
params={"simple_lgbm__categorical_feature": cat_cols})
lgbm_rmsles
array([0.79127413, 0.50816762, 0.5532748 , 0.52739194, 0.44380139])
Except the first fold which lie between the years 2016 and 2017, the metrics are rather consistent. We compare these results with baseline values, which are calculated using the values from the last 14 days before the out-of-sample time steps to predict exactly those same periods.
def cv_lag14_baselines(y: pd.Series|np.ndarray, cv: iter) -> np.ndarray:
if hasattr(y, "to_numpy"):
y = y.to_numpy()
rmsles = [root_mean_squared_log_error(y[fold[1]], y[fold[1]-14]) for fold in cv]
return np.array(rmsles)
cv_14days_baseline = cv_lag14_baselines(y, ts_cv)
cv_14days_baseline
array([1.90820022, 0.63408609, 0.73017002, 0.49844971, 0.56071666])
The first fold's baseline is much worse than the second-worst baseline, which occurs in the third fold. The first fold lies within a regime shift at the very end of December, meaning that values from December, including the very high pre-Christmas values, are used to forecast the much lower values of January. We therefore need a baseline which is harder to beat in this case: using the values from exactly one year ago, which avoids problems caused by regime shifts and compares holidays with the same holiday one year earlier (while, of course, introducing a slight bias due to weekday shifts).
def recreate_dates(df: pd.DataFrame, day_col: str = "day", month_col: str = "month", year_col: str = "year") -> np.ndarray:
month_dict = dict(zip(months, range(1, 13)))
month_num = df[month_col].map(month_dict)
return pd.to_datetime(dict(year=df[year_col], month=month_num, day=df[day_col])).to_numpy(dtype="datetime64[D]")
def cv_last_year_baselines(X: pd.DataFrame, y: pd.Series|np.ndarray, cv) -> np.ndarray:
"""Assumes X and y ordered by date."""
y = np.array(y)
rmsles = []
for fold in cv:
original_dates = recreate_dates(X.iloc[fold[1]])
lag365_dates = original_dates - np.timedelta64(365, "D")
days, lag365_days = [(dates - dates.astype("datetime64[M]")).astype(int) + 1 # extracts days of month
for dates in [original_dates, lag365_dates]]
offsets = np.where(days==lag365_days, 365, 366) # if day of month differs we must have crossed Feb 29 -> 366
offset_idx = fold[1]-offsets
if (offset_idx < 0).any():
raise ValueError("Either are time series too short for comparing with dates one year ago or cv indices are wrong")
rmsles.append(root_mean_squared_log_error(y[fold[1]], y[offset_idx]))
return np.array(rmsles)
cv_last_year_baseline = cv_last_year_baselines(X, y, ts_cv)
cv_last_year_baseline
array([0.67756349, 0.73725381, 0.81803906, 0.92362552, 0.97257277])
The first fold's baseline value increased substantially, while the other folds decreased. Comparing both baselines, we notice that they both perform poorly on the third fold. This is probably caused by Easter/Good Friday falling into the third fold's validation window. Neither the 14-day baseline nor the last-year baseline can capture such holiday effects, which affects all naive forecasting approaches.
More generally, the inconsistent baseline performance across folds indicates substantial non-stationarity: within only half a year, different folds represent distinct regimes with varying levels of predictability. This confirms that the dataset is noisy and regime-dependent rather than smoothly seasonal.
However, the model's results are mostly better than those of the baselines. To be more precise, we calculate the model's improvement over the baselines in percentage terms:
def cv_improvement(cv_metric_values: np.ndarray, cv_baselines: np.ndarray):
return np.round((1 - (cv_metric_values / cv_baselines)) * 100, 1)
cv_lag14_improvements = cv_improvement(lgbm_rmsles, cv_14days_baseline)
cv_last_year_improvements = cv_improvement(lgbm_rmsles, cv_last_year_baseline)
print("Last 14 days baseline:", ", ".join(cv_lag14_improvements.astype(str)),
"\nLast year baseline: ", ", ".join(cv_last_year_improvements.astype(str)))
Last 14 days baseline: 58.5, 19.9, 24.2, -5.8, 20.9 Last year baseline: -16.8, 31.1, 32.4, 42.9, 54.4
This shows how important strong baselines are. While lgbm_rmsles appeared relatively steady across folds, comparison with the corresponding baselines reveal that the model especially struggles with the first fold (when compared to he last-year baseline) but also with the fourth fold (compared to last-14-days baseline). Even with only five folds, it is still worth examining some descriptive statistics:
pd.concat(
[pd.Series(cv_lag14_improvements).describe(),
pd.Series(cv_last_year_improvements).describe()], axis=1
).rename(columns={0: "Last 14 days", 1: "Last year"})
| Last 14 days | Last year | |
|---|---|---|
| count | 5.000000 | 5.000000 |
| mean | 23.540000 | 28.800000 |
| std | 22.933011 | 27.165143 |
| min | -5.800000 | -16.800000 |
| 25% | 19.900000 | 31.100000 |
| 50% | 20.900000 | 32.400000 |
| 75% | 24.200000 | 42.900000 |
| max | 58.500000 | 54.400000 |
With these average improvements this simple model would be worth putting into production. While all model forecasts outperformed their baselines, the variation is quite high for both baselines.
It is now time to increase the model's complexity by adding lag features and exponential moving averages to the LightGBM model. We will also need to add a recursive prediction step to prevent future leakage.
Since the best predictor of the future is the past, we want to include features based on past values of the time series, such as lags and rolling means to inject temporal memory explicitly. This requires not only new functions and corresponding transformers but also a custom LightGBM model, because when predicting more than one time step ahead, we no longer know the true past values.
Therefore, we need to add the forecast from the previous time step back into the dataset and recompute the lagged features for the next step. These are then combined with the features known in advance (such as weekday, month, and holidays), and the process is repeated recursively.
We start by defining a function that creates lagged values for each time series. I also added a flag, inference_mode, which indicates whether we only need the last lagged values of a series—as during prediction—or a shifted version of the full series. I use example arrays to illustrate the functions' behaviour which were also useful for debugging.
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
def make_lags(X: pd.DataFrame | pd.Series | np.ndarray, num_series: int, lags: list[int],
inference_mode: bool = False) -> np.ndarray:
"""Computes lagged values of contiguous time series"""
if lags == []:
return None # in case we do not want to include any lag features
if hasattr(X, "to_numpy"):
X = np.asarray(X)
X_re = X.reshape(num_series, -1)
series_len = X_re.shape[1]
lags_arr = np.array(lags)
if np.any(lags_arr >= series_len):
raise ValueError(f"At least one lag value ≥ series length {series_len}")
lagged_features = []
first_col = X_re[:, [0]]
if not inference_mode:
for lag in lags:
shifted = X_re[:, :-lag]
backfill = np.repeat(first_col, lag, axis=1) # repeat first value of each series lag times
lagged_X = np.concatenate([backfill, shifted], axis=1).flatten()
lagged_features.append(lagged_X)
return np.column_stack(lagged_features).astype("float32")
else:
# vectorized lag extraction
return X_re[:, -(lags_arr + 1)].reshape(num_series, -1, len(lags)).astype("float32") # beneficial shape for forecasts
make_lags(arr, num_series=2, lags=[1, 3])
array([[0., 0.],
[0., 0.],
[1., 0.],
[2., 0.],
[3., 1.],
[5., 5.],
[5., 5.],
[6., 5.],
[7., 5.],
[8., 6.]], dtype=float32)
# only last values of each series as in prediction
make_lags(arr, num_series=2, lags=[1, 3], inference_mode=True)
array([[[3., 1.]],
[[8., 6.]]], dtype=float32)
Next, we define a function for seasonal rolling features: it calculates the average of several lags that are multiples of seasonal cycles (e.g. average of last 7, 14, 21, 28 days for weekly seasonality).
Including seasonal standard deviations slightly worsened RMSLE and increased dimensionality, so I reverted to using only seasonal means.
def make_seasonal_roll_means(X: pd.DataFrame | pd.Series | np.ndarray, num_series: int, seas_roll_dict: dict[list],
inference_mode=False):
"""Computes rolling means over seasonal lag windows, e.g., 7, 14, 21, 28 days, of contiguous time series"""
if hasattr(X, "to_numpy"):
X = np.asarray(X)
X_re = X.reshape(num_series, -1)
series_len = X_re.shape[1]
if any(lag >= series_len for lags in seas_roll_dict.values() for lag in lags):
raise ValueError(f"At least one lag value ≥ series length {series_len}")
seas_roll_features = []
if not inference_mode:
for key in seas_roll_dict:
lags = seas_roll_dict[key]
lagged_Xs = []
first_col = X_re[:, [0]]
for lag in lags:
shifted = X_re[:, :-lag]
backfill = np.repeat(first_col, lag, axis=1) # repeat first value of each series lag times
lagged_Xs.append(
np.concatenate([backfill, shifted], axis=1).flatten()
)
seas_roll_features.append(np.column_stack(lagged_Xs).mean(axis=1)) # mean of shifted series as adjacent cols
return np.column_stack(seas_roll_features).astype("float32")
else:
for key in seas_roll_dict:
lags_arr = np.array([-(lag + 1) for lag in seas_roll_dict[key]])
seas_roll_features.append(X_re[:, lags_arr].mean(axis=1, keepdims=True))
return (
np.column_stack(seas_roll_features)
.reshape(num_series, 1, len(seas_roll_dict))
.astype("float32")
)
long_arr = np.arange(20)
dict_example = {"2s": [2, 4, 6], "3s": [3, 6, 9]}
make_seasonal_roll_means(long_arr, num_series=2, seas_roll_dict=dict_example, inference_mode=False)
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 0. , 0. ],
[ 0.33333334, 0. ],
[ 0.6666667 , 0.33333334],
[ 1.3333334 , 0.6666667 ],
[ 2. , 1. ],
[ 3. , 1.6666666 ],
[ 4. , 2.3333333 ],
[ 5. , 3. ],
[10. , 10. ],
[10. , 10. ],
[10. , 10. ],
[10.333333 , 10. ],
[10.666667 , 10.333333 ],
[11.333333 , 10.666667 ],
[12. , 11. ],
[13. , 11.666667 ],
[14. , 12.333333 ],
[15. , 13. ]], dtype=float32)
# only last values of each series as in prediction
make_seasonal_roll_means(long_arr, 2, dict_example, inference_mode=True)
array([[[ 5., 3.]],
[[15., 13.]]], dtype=float32)
We include exponential weighted moving averages (EWMA) to represent recent levels while emphasizing more recent values. For fast computation, we use the scipy.signal.lfilter, since the recursive EWMA update
$$ewma_t = αy_t + (1 - α)ewma_{t-1}$$
can be expressed as a linear filter
$$a_0y_n = b_0x_n - a_1y_{n-1}$$
where $y$ represents the EWMA and $x$ the original series $y_t$. Choosing $b_0 = α$, $a_0 = 1$ and, $a_1 = -(1 - α)$ yields the desired recursive relationship.
The parameter $α$ corresponds to a span, i.e., the number of time steps after which the effective weight becomes negligible (close to zero). The following function internally computes the corresponding $α$ for the spans provided, so we only need to specify the span — a more intuitive parameter for humans to reason about.
During inference (when inference_mode = True), we could compute the most recent values in closed form using exponentially decaying weights. However, we instead exploit the recursive form, as the last target and EWMA values for each series are already known from training, greatly reducing computational complexity.
from scipy.signal import lfilter
def make_ewmas(X: pd.DataFrame | pd.Series | np.ndarray, num_series: int, spans: list[int], inference_mode: bool = False,
last_ewma_states: np.ndarray = None) -> np.ndarray:
"""
Computes exponential weighted moving averages (EWMAs) of contiguous time series using the provided spans.
When `inference_mode=True`, a 3D-array of shape (num_series, 1, len(spans)) must be passed a `last_ewma_states`.
This array represents the most recent EWMA values (at time step t–1) for each series, which are used to
compute the updated EWMAs at time step t. The resulting EWMAs can then serve as features for predicting
values at time step t+1 during recursive inference.
"""
if hasattr(X, "to_numpy"):
X = np.asarray(X, dtype=float) # lfilter expects floats
X_re = X.reshape(num_series, -1)
alphas = 2 / (1 + np.array(spans)) # 1D array with shape(len(spans),)
n_alphas = alphas.shape[0]
ewmas_shifted = np.array(
[lfilter([alpha], [1, -(1 - alpha)], X_re[:, :-1]) for alpha in alphas] # shift to avoid future leakage
) # shape (alphas, series, timesteps)
if not inference_mode:
backfill = ewmas_shifted[:, :, [0]] # align to original shape
ewmas_filled = np.concatenate([backfill, ewmas_shifted], axis=2)
return (
ewmas_filled
.transpose(1, 2, 0) # series -> groups, timesteps -> rows, alphas -> cols
.reshape(-1, n_alphas)
.astype("float32")
)
else:
if last_ewma_states is None:
raise ValueError("Argument 'last_ewma_states' cannot be None when inference_mode=True.")
if not (last_ewma_states.ndim == 3 and last_ewma_states.shape == (num_series, 1, len(spans))):
raise ValueError(f"'last_ewma_states' must be a 3D array with shape "
f"({num_series}, 1, {len(spans)}), but got {last_ewma_states.shape}.")
X_last = X_re[:, [-1]].reshape(num_series, 1, 1)
ewma_states = alphas * X_last + (1 - alphas) * last_ewma_states
return ewma_states.reshape(num_series, 1, n_alphas).astype("float32")
example_ewmas = make_ewmas(X=arr, num_series=2, spans=[3, 7, 31])
example_ewmas
array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0.5 , 0.25 , 0.0625 ],
[1.25 , 0.6875 , 0.18359375],
[2.125 , 1.265625 , 0.35961914],
[2.5 , 1.25 , 0.3125 ],
[2.5 , 1.25 , 0.3125 ],
[4.25 , 2.4375 , 0.66796875],
[5.625 , 3.578125 , 1.0637207 ],
[6.8125 , 4.6835938 , 1.4972382 ]], dtype=float32)
Using the last state and actual value of each series allows to calculate the next value of the recursion out of the sample:
# [:, [-1], :] is last state of each series over all ewma features
make_ewmas(X=arr, num_series=2, spans=[3, 7, 31], inference_mode=True,
last_ewma_states=example_ewmas.reshape(2, -1, 3)[:, [-1], :])
array([[[3.0625 , 1.9492188 , 0.58714294]],
[[7.90625 , 5.7626953 , 1.9661608 ]]], dtype=float32)
Now we can define the lags, ewma spans and dictionary for the seasonal rolling means, for which we define an own function to make the input more convenient.
# often beneficial to use lags slightly exceeding one seasonaly cycle, 366 for leap years
lags = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 21, 28, 365, 366]
spans = [7, 14, 30, 365]
def create_seas_roll_dict(periods_dict: dict[list[int]] = {"weeks": [5], "months": [3], "years": [3]}) -> dict:
"""Returns a dictionary in the form that make_seasonal_roll_means requires."""
translate_periods_to_lags = {"weeks": 7, "months": 30, "years": 365}
seas_roll_dict = {
f"roll{n}{period}": [translate_periods_to_lags[period] * i for i in range(1, n + 1)]
for period, ns in periods_dict.items()
for n in ns
}
return seas_roll_dict
seas_roll_dict = create_seas_roll_dict()
seas_roll_dict # same as original
{'roll5weeks': [7, 14, 21, 28, 35],
'roll3months': [30, 60, 90],
'roll3years': [365, 730, 1095]}
With all this in place, we can now implement a custom transformer class that efficiently adds these lagged features. When the series length matches the defined horizon—i.e., when transforming out-of-sample data—the transformer simply returns NaNs in the correct shape, as the recursive model will compute these features itself from the last predictions using the inference_mode flag of the underlying lagged features functions.
This ensures consistent column ordering between model fitting and inference and prevents shape errors in both LightGBM and in the underlying lagged-feature functions which verify that the available window is at least as long as the maximum lag.
class LaggedFeaturesTransformer(BaseEstimator, TransformerMixin):
"""Uses lagged feature making functions to generate said features."""
def __init__(self, num_series: int = None, horizon: int = 14, lags: list[int] = lags,
seasonal_roll_dict: dict[list] = seas_roll_dict, ewma_spans: list[int] = spans):
self.num_series = num_series
self.horizon = horizon
self.lags = lags
self.seasonal_roll_dict = seasonal_roll_dict
self.ewma_spans = ewma_spans
def fit(self, X, y=None):
self.n_features_in_ = X.shape[1] if X.ndim == 2 else 1
self.feature_names_in_ = getattr(X, 'columns', None)
self.n_lags_ = len(self.lags)
self.n_seas_rolls_ = len(self.seasonal_roll_dict)
self.n_ewmas_ = len(self.ewma_spans)
self.n_lagged_features_ = self.n_lags_ + self.n_seas_rolls_ + self.n_ewmas_
return self
def transform(self, X, y=None):
"""
When the size of X is only as long as the forecast horizon (as in out-of-sample prediction), the computation is
skipped entirely and only placeholders filled with NaNs are used to prevent the underlying lagged-feature
functions to raise errors that the available window is shorter than some of the lags.
The filling also ensures consistent column ordering.
"""
n_steps = len(X) // self.num_series
if n_steps == self.horizon:
return np.full((X.shape[0], self.n_lagged_features_), np.nan, dtype=float)
else:
lag_block = make_lags(X, self.num_series, self.lags)
roll_block = make_seasonal_roll_means(X, self.num_series, self.seasonal_roll_dict)
ewma_block = make_ewmas(X, self.num_series, self.ewma_spans)
blocks = [lag_block, roll_block, ewma_block]
return np.concatenate([b for b in blocks if b is not None], axis=1)
def get_feature_names_out(self, input_features=None):
if input_features is None:
input_features = self.feature_names_in_
lag_names = [f"{inp}_lag{lag}" for inp in input_features for lag in self.lags]
roll_names = [f"{inp}_{roll}" for inp in input_features for roll in self.seasonal_roll_dict]
ewma_names = [f"{inp}_ewma{span}" for inp in input_features for span in self.ewma_spans]
return lag_names + roll_names + ewma_names
lagged_features_transformer = LaggedFeaturesTransformer(num_series=num_series_sample)
pd.DataFrame(data=lagged_features_transformer.transform(X["sales"]).to_numpy(),
columns=lagged_features_transformer.get_feature_names_out(["sales"]))
| sales_lag1 | sales_lag2 | sales_lag3 | sales_lag4 | sales_lag5 | sales_lag6 | sales_lag7 | sales_lag8 | sales_lag9 | sales_lag10 | ... | sales_lag28 | sales_lag365 | sales_lag366 | sales_roll5weeks | sales_roll3months | sales_roll3years | sales_ewma7 | sales_ewma14 | sales_ewma30 | sales_ewma365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 1 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 2 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 4 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 517261 | 8.642 | 13.915 | 17.378 | 16.059999 | 17.868000 | 31.424999 | 17.483999 | 9.592000 | 13.489000 | 10.200000 | ... | 22.438000 | 9.002000 | 9.462000 | 17.289801 | 21.361000 | 9.270000 | 14.784315 | 15.797618 | 16.651100 | 17.659334 |
| 517262 | 11.000 | 8.642 | 13.915 | 17.378000 | 16.059999 | 17.868000 | 31.424999 | 17.483999 | 9.592000 | 13.489000 | ... | 33.608002 | 12.958000 | 9.002000 | 27.110399 | 15.595333 | 10.239000 | 13.838236 | 15.157935 | 16.286514 | 17.622944 |
| 517263 | 21.916 | 11.000 | 8.642 | 13.915000 | 17.378000 | 16.059999 | 17.868000 | 31.424999 | 17.483999 | 9.592000 | ... | 33.925999 | 32.563999 | 12.958000 | 22.188200 | 31.150333 | 17.778666 | 15.857677 | 16.059011 | 16.649706 | 17.646404 |
| 517264 | 19.909 | 21.916 | 11.000 | 8.642000 | 13.915000 | 17.378000 | 16.059999 | 17.868000 | 31.424999 | 17.483999 | ... | 17.334999 | 48.562000 | 32.563999 | 13.320000 | 29.148333 | 31.948000 | 16.870508 | 16.572342 | 16.859983 | 17.658768 |
| 517265 | 12.000 | 19.909 | 21.916 | 11.000000 | 8.642000 | 13.915000 | 17.378000 | 16.059999 | 17.868000 | 31.424999 | ... | 5.097000 | 15.094000 | 48.562000 | 10.555600 | 20.350334 | 19.200666 | 15.652881 | 15.962697 | 16.546436 | 17.627846 |
517266 rows × 22 columns
Definition of Custom Recursive Regressor Class ↑¶
For recursive forecasting, we need to define a custom regressor. Since scikit-learn's estimators cannot handle **kwargs, subclassing LGBMRegressor directly would require explicitly listing every parameter we want to support. This quickly becomes tedious and error-prone, which is why I prefer to compose the custom model class with a base_estimator which defaults to an instance of LGBMRegressor.
We can still define LightGBM parameters via recursive_lgbm__base_estimator__parameter_name for potential tuning without having to list every parameter explicitly. We could even set all other sklearn Regressors as the base_estimator, since the logic remains the same—although there is currently no need for that, it could be interesting for experimenting with ensembles.
The custom regressor’s predict method uses LightGBM’s predict function for one-step-ahead forecasting, which is then applied recursively. After each prediction, the new value is appended to the target so that the model can recompute the lagged features. The required specifications for this update are derived from the fitted feature names and passed to the lagged-feature functions together with the updated target values and, in the case of EWMAs, the most recent EWMA states.
The entire computation is fully vectorized by taking advantage of the fact that all series share the same length. This allows the data to be reshaped into a 3D array representing series, timesteps, and features, which removes the need for time-consuming pandas sorting and grouping operations and enables real-time recursive predictions. This design makes the regressor both efficient and scalable, allowing it to handle many parallel time series with minimal overhead.
The 3D reshaping concept was developed specifically for this project to efficiently handle multiple series in parallel, where the data can be imagined as a cube of matrices—each matrix representing one time series with timesteps as rows and features as columns. The recursion then slices the cube’s front face to access the most recent timestep across all series simultaneously.
from sklearn.base import RegressorMixin, clone
class RecursiveRegressor(BaseEstimator, RegressorMixin):
"""
Custom recursive regressor for multi-series forecasting.
This class wraps a base estimator (default: LightGBM) and performs recursive multi-step forecasting by reusing
each new prediction to update lagged features and EWMAs. The computation is fully vectorized across all series
for efficient parallel recursion.
"""
def __init__(self, base_estimator=None, num_series: int=None, use_recursive=True):
self.base_estimator = base_estimator
self.num_series = num_series
self.use_recursive = use_recursive
def fit(self, X: pd.DataFrame, y: pd.DataFrame | np.ndarray, **fit_params):
"""
Fits the base estimator and extracts the metadata required
for recursive forecasting.
Parameters
----------
X : pd.DataFrame
Training features containing lagged, rolling, and EWMA features.
y : pd.DataFrame or np.ndarray
Target values aligned with X.
**fit_params : dict
Additional parameters passed to the base estimator's fit method.
"""
if not isinstance(X, pd.DataFrame):
raise TypeError("RecursiveLGBM requires a pandas DataFrame for feature extraction.")
self.feature_names_in_ = X.columns # must be explicit as None would be fatal
self.n_features_in_ = X.shape[1]
# fit
est = clone(self.base_estimator) if self.base_estimator is not None else LGBMRegressor()
self.est_ = est
self.est_.fit(X, y, **fit_params)
self.model_class_name = type(self.est_).__name__
# extract idx and arguments for lagged features of X col names
self.lagged_idx_args_funcs_, self.lagged_start_idx_ = self._get_lagged_start_idx_args_funcs(X)
X = np.asarray(X)
y = np.asarray(y)
# store last max_lag values of y per series for lagged feature creation in predict()
max_lag = max(self.lagged_idx_args_funcs_["lag"]["args"]
+ [item for value in self.lagged_idx_args_funcs_["roll"]["args"].values()
for item in value])
self.y_tail_init_ = y.reshape(self.num_series, -1)[:, -(max_lag + 1):] # 2D (n_series, n_steps)
# store last ewmas for recursive computation of next ewma_state in predict()
ewma_idx = self.lagged_idx_args_funcs_["ewma"]["idx"]
self.ewma_states_init_ = (
X[:, ewma_idx]
.reshape(self.num_series, -1, len(ewma_idx)) # 3D with shape(n_series, n_steps, n_ewma_features)
[:, [-1], :] # only last ewmas of each series
)
return self
def predict(self, X: pd.DataFrame | np.ndarray) -> np.ndarray:
"""
Performs one-step or recursive multi-step forecasting.
Parameters
----------
X : pd.DataFrame or np.ndarray
Input features for prediction. Can include known future features
(e.g., exogenous variables) but excludes lagged features in
recursive mode.
Returns
-------
np.ndarray
Predicted target values for each series and timestep.
"""
X = np.asarray(X)
n_features = X.shape[1]
if not self.use_recursive:
if n_features != self.n_features_in_:
raise ValueError(f"Number of predict features ({n_features}) does not equal"
f"number of fitted features ({self.n_features_in_})")
return self._one_step_predict(X)
else: # use_recursive==True
# check correct feature number
n_lagged_features = np.sum([len(spec["args"]) for spec in self.lagged_idx_args_funcs_.values()])
idx_lagged_features = np.array([idx for spec in self.lagged_idx_args_funcs_.values() for idx in spec["idx"]])
n_non_lagged_features = self.n_features_in_ - n_lagged_features
# ensure only known features are used
if n_features == self.n_features_in_:
X = X[:, ~np.isin(np.arange(n_features), idx_lagged_features)] # keep only known features
else:
if n_features == n_non_lagged_features: # ok, if only known features
pass
else:
ValueError(
f"Number of predict features ({n_features}) does neither equal number of fitted"
f"({self.n_features_in_}) nor known features ({n_non_lagged_features})"
)
X_re = X.reshape(self.num_series, -1, n_non_lagged_features) # 3D: (n_series, n_steps, n_features)
y_tail = self.y_tail_init_.copy()
last_ewma_states = self.ewma_states_init_.copy()
horizon = X_re.shape[1]
for step in range(horizon):
# update new_feat and last_ewma_states
new_feat, last_ewma_states = self._update_lagged_features(y_tail, last_ewma_states)
known_feat = X_re[:, [step], :]
# take care of correct position of new_feat (either at the beginning or end)
if self.lagged_start_idx_ == 0:
features = np.concatenate(new_feat + [known_feat], axis=2)
else:
features = np.concatenate([known_feat] + new_feat, axis=2)
pred_feat = features.reshape(self.num_series, -1) # flatten single steps to 2D (n_series, n_features)
pred = self._one_step_predict(pred_feat).reshape(-1, 1) # 2D: (n_series, 1)
y_tail = np.concatenate([y_tail, pred], axis=1) # 2D: add pred as newest time step in y_tail
return y_tail[:, -horizon:].reshape(-1, 1) # last horizon steps are predictions
def _one_step_predict(self, X: pd.DataFrame | np.ndarray) -> np.ndarray:
"""
The estimator's original predict method to forecast one step ahead.
LightGBM always internally stores feature names even when fitted on an array ("Column0", "Column1", etc.).
Therefore the original DataFrame is used for fitting and its structure is recreated here with the original
feature names, as this adds slightly more additional safety and the conversion is nearly free.
While the class takes care of the biggest part of ordering itself, silencing the resulting warning concerning
no feature names when predicting on an array would be another possiblity.
"""
if isinstance(X, np.ndarray):
X = pd.DataFrame(X, columns=self.feature_names_in_)
return self.est_.predict(X)
def _update_lagged_features(self, y_tail: np.ndarray, last_ewma_states: np.ndarray) -> tuple[list, np.ndarray]:
"""
Recomputes lagged features using the latest predictions.
This method updates lagged, rolling, and EWMA-based features at each recursive step. It uses the pre-stored
function references and arguments obtained during fitting, and injects the current state (e.g., latest EWMA
values) as needed.
Parameters
----------
y_tail : np.ndarray
Array containing the most recent target values for each series, including the latest predictions.
Shape: (n_series, n_steps).
last_ewma_states : np.ndarray
Current EWMA states for each series and feature, used to compute the next step's EWMA values.
Shape: (n_series, 1, n_ewma_features).
Returns
-------
new_features : list of np.ndarray
A list containing arrays of recomputed features (lags, rolling means, and EWMAs), each with
shape (n_series, 1, n_features_of_type).
last_ewma_states : np.ndarray
Updated EWMA states after computing the new features.
"""
# calculate features for next step
new_features = []
for name, spec in self.lagged_idx_args_funcs_.items():
kwargs = {
"X": y_tail,
"num_series": self.num_series,
"inference_mode": True
}
if name == "lag":
kwargs["lags"] = spec["args"]
elif name == "roll":
kwargs["seas_roll_dict"] = spec["args"]
elif name == "ewma":
kwargs["spans"] = spec["args"]
kwargs["last_ewma_states"] = last_ewma_states
new_features.append(spec["func"](**kwargs))
# update last_ewma_states with new states to feed back for next iteration
keys = np.array(list(self.lagged_idx_args_funcs_.keys()))
ewma_pos = np.where(keys == "ewma")[0].item()
last_ewma_states = new_features[ewma_pos]
return new_features, last_ewma_states
def _get_lagged_start_idx_args_funcs(
self, X: pd.DataFrame, lagged_patterns: list[int] = ["lag", "roll", "ewma"],
func_map: dict = {"lag": make_lags, "roll": make_seasonal_roll_means, "ewma": make_ewmas}
) -> tuple[dict, int]:
"""
Collects lagged feature metadata from X:
- indices of columns for each pattern
- static arguments (lags, seasonal roll dicts, spans)
- function reference for recursive updates
Returns a sorted dict + the start index of the lagged feature block.
"""
cols = self.feature_names_in_
# derive column idx and arguments from cols to recursively compute lagged futures in predict()
arg_dict = {}
for pattern in lagged_patterns:
lagged_cols = [col for col in cols if pattern in col]
lagged_idx = np.where(np.isin(cols, lagged_cols))[0]
if pattern == "roll":
roll_dict = {}
for col in lagged_cols:
match = re.search(f"_{pattern}(\\d+)(\\w+)", col)
n, period = int(match.group(1)), match.group(2)
roll_dict.setdefault(period, []).append(n)
spec_arg = create_seas_roll_dict(roll_dict)
else:
spec_arg = [int(re.search(f"_{pattern}(\\d+)", col).group(1)) for col in lagged_cols]
arg_dict[pattern] = {"idx": lagged_idx, "args": spec_arg, "func": func_map[pattern]}
# Sort by first column index
sorted_idx_key_tuples = sorted((min(v["idx"]), k) for k, v in arg_dict.items())
order = [k for min_idx, k in sorted_idx_key_tuples]
start_idx = min(min(value["idx"]) for value in arg_dict.values())
last_idx = max(max(value["idx"]) for value in arg_dict.values())
if start_idx != 0 and last_idx != len(cols) - 1:
raise ValueError("Lagged features block must either precede or succeed known features block")
# return sorted arg_dict and the col_idx where the lagged feature block starts
return {key: arg_dict[key] for key in order}, start_idx
def get_feature_names_out(self, input_features=None):
input_features = self.feature_names_in_
return np.asarray(input_features)
With the recursive regressor defined, we can start building the pipeline for the recursive forecast.
Training and Evaluation ↑¶
We reuse several transformers and components from the non-recursive pipeline and add new ones, such as lagged_features, which uses the defined LaggedFeaturesTransformer to the already established target_pipe. We also fit the pipeline once to retrieve the feature names and extract the indices of the categorical features to provide them to the recursive regressor.
from sklearn.preprocessing import OrdinalEncoder
num_groups = num_series_sample
# reusing target_pipe for consistent lagged features with the target
lagged_pipe = make_pipeline(
target_pipe,
LaggedFeaturesTransformer(num_series=num_groups)
)
# defining an ordinal encoder enables forecasting new series via unknown_value
ordinal_encoder = OrdinalEncoder(dtype=int, handle_unknown="use_encoded_value", unknown_value=-1)
pre = ColumnTransformer(
[
("lagged_features", lagged_pipe, ["sales"]),
("cat", ordinal_encoder, cat_cols), # day, month, holiday and basic (store_nbr, family, city, etc.) cols
("fourier", fourier_transformer, fourier_cols),
("reals", to_float32, selector_no_sales),
("dist_clip", clip_transformer, dist_holiday_cols),
],
remainder="passthrough",
verbose_feature_names_out=False,
)
recursive_lgbm = TransformedTargetRegressor(
RecursiveRegressor(
base_estimator=LGBMRegressor(
learning_rate=0.1, # default
num_leaves=31, # default
random_state=42,
force_row_wise=True,
verbosity=0),
num_series = num_groups
),
transformer=target_pipe,
check_inverse=False #essential: avoid subset transform that breaks group-size assumption
)
recursive_lgbm_pipe = Pipeline([
("pre", pre),
("recursive_lgbm", recursive_lgbm)
])
# get the indices of categorical features in the resulting Frame to feed them to recursive_lgbm_pipe.fit()
preproc = recursive_lgbm_pipe[:-1] # only pipelines not the the TransformedTargetRegressor
preproc.fit(X, y)
cat_idx = np.where(
np.isin(preproc.get_feature_names_out(), cat_cols)
)[0].tolist()
recursive_lgbm_pipe
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('lagged_features',
Pipeline(steps=[('pipeline',
Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<funct...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=309))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('recursive_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('lagged_features', ...), ('cat', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['sales']
Parameters
| steps | [('functiontransformer-1', ...), ('functiontransformer-2', ...), ...] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 309 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Parameters
| num_series | 309 | |
| horizon | 14 | |
| lags | [1, 2, ...] | |
| seasonal_roll_dict | {'roll3months': [30, 60, ...], 'roll3years': [365, 730, ...], 'roll5weeks': [7, 14, ...]} | |
| ewma_spans | [7, 14, ...] |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x3b42d00d0>
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x3b427fac0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x3b427fb50> | |
| kw_args | None | |
| inv_kw_args | None |
['days_elapsed', 'year', 'mundial_stage']
passthrough
Parameters
| regressor | RecursiveRegr...um_series=309) | |
| transformer | Pipeline(step...groups=309))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
RecursiveRegressor(base_estimator=LGBMRegressor(force_row_wise=True,
random_state=42, verbosity=0),
num_series=309)LGBMRegressor(force_row_wise=True, random_state=42, verbosity=0)
Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 31 | |
| max_depth | -1 | |
| learning_rate | 0.1 | |
| n_estimators | 100 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0 | |
| min_child_weight | 0.001 | |
| min_child_samples | 20 | |
| subsample | 1.0 | |
| subsample_freq | 0 | |
| colsample_bytree | 1.0 | |
| reg_alpha | 0.0 | |
| reg_lambda | 0.0 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 309 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Now let us have a look at how this recursive model performs in cross-validation:
recursive_lgbm_rmsles = -cross_val_score(recursive_lgbm_pipe,
X,
y,
scoring="neg_root_mean_squared_log_error",
cv=ts_cv,
params={"recursive_lgbm__categorical_feature": cat_idx})
recursive_lgbm_rmsles
array([1.15484401, 0.48698419, 0.56164596, 0.40917731, 0.45350827])
How much better than our simple LightGBM-baseline model is this recursive model in percentage improvement?
cv_improvement(recursive_lgbm_rmsles, lgbm_rmsles)
array([-45.9, 4.2, -1.5, 22.4, -2.2])
We observe increases compared to the non-recursive model in folds 2 and 4 but the other folds drop, especially the already poorly performing first fold. This fold spans from late December to mid-January. The lagged features containing the unusually higher values of the Christmas season lead to these worse results. We could increase the number of leaves or the depth of the trees to help LightGBM capture the interactions between the post-Christmas folds and the lags more effectively, or add flags which tell LightGBM when to ignore misleading lagged features.
Still, it remains questionable whether this could resolve the first fold's worse values entirely. Thus, we hope that our next model, a Temporal Fusion Transformer (TFT), will capture this regime shift better thanks to its attention mechanisms which should help identify which values in the sequence can be ignored and which should be weighted more strongly.
Next, we first want to assess how well the model generalizes. Therefore, we examine how it performs on the training data, after having been fitted to that same data:
def in_sample_metrics(X, y, pipeline, ts_cv) -> np.array:
in_sample_rmsles = []
for fold in ts_cv:
tr, val = fold
pipe = clone(pipeline)
pipe.fit(X.iloc[tr], y.iloc[tr])
pipe[-1].regressor_.use_recursive = False
y_pred = pipe.predict(X.iloc[tr])
in_sample_rmsles.append(
get_metrics(y.iloc[tr], y_pred, print_logs=False)["RMSLE"]
)
return np.array(in_sample_rmsles)
in_sample_metrics(X, y, recursive_lgbm_pipe, ts_cv)
array([0.3823, 0.3842, 0.3855, 0.3867, 0.3858])
Apart from the first two folds, the values are slightly lower, but still quite similar to the out-of-sample predictions. However, this may not be an entirely fair comparison, since the out-of-sample predictions could also suffer from error accumulation due to the recursive nature of the model. To understand this better—without changing the pipeline architecture—we analyze the metric distributions for each forecast step (1-14 days ahead):
def out_of_sample_per_distance_metrics(X, y, pipeline, ts_cv, n_series) -> np.array:
"""Separately calculates the RMSLE value for each time step ahead and returns them as an array."""
fold_rmsles = []
for fold in ts_cv:
tr, val = fold
pipe = clone(pipeline)
pipe.fit(X.iloc[tr], y.iloc[tr])
y_pred = pipe.predict(X.iloc[val])
y_pred_re = y_pred.reshape(n_series, -1)
y_true_re = y.iloc[val].to_numpy().reshape(n_series, -1)
metrics_per_distance = [get_metrics(y_true_re[i], y_pred_re[i], print_logs=False)["RMSLE"]
for i in range(y_true_re.shape[1])]
fold_rmsles.append(metrics_per_distance)
return np.array(fold_rmsles)
metrics_per_distance = out_of_sample_per_distance_metrics(X, y, recursive_lgbm_pipe, ts_cv, num_series_sample)
metrics_per_distance_df = (
pd.DataFrame(metrics_per_distance, columns=range(1, 15))
.assign(Fold=range(1, 6))
.melt(id_vars="Fold", var_name="Step", value_name="RMSLE")
)
plt.figure(figsize=(10, 4))
sns.boxplot(metrics_per_distance_df, x="Step", y="RMSLE");
plt.title("Distribution Across CV-Folds for Each Prediction Step");
plt.xlabel("Steps Ahead");
The distribution of RMSLE values does not show a consistent upward trend with increasing forecast steps. This indicates that error accumulation in the recursive forecasts is weak, and the higher out-of-sample RMSLE values primarily reflect normal generalization differences rather than instability in the recursive prediction process.
Reduce Features for TFT ↑¶
Since training a Temporal Fusion Transformer is resource-intensive, we want to reduce the feature set to only relevant features. While our recursive LightGBM is, of course, from a completely different model class, we use its feature importances to identify features that are negligible. Although these importances are model-specific, they still provide a useful first-order signal for filtering clearly irrelevant features before moving to the more expensive TFT setup. Therefore, we inspect the feature importances next.
def get_feature_importance(fitted_lgbm = recursive_lgbm.regressor_.est_) -> pd.DataFrame:
imp_split = fitted_lgbm._Booster.feature_importance("split")
imp_gain = fitted_lgbm._Booster.feature_importance("gain")
feat_names = fitted_lgbm._Booster.feature_name()
imp_df = (pd.DataFrame({
"Feature": feat_names,
"Split": imp_split.astype(int), # counts → int
"Gain": imp_gain.astype(float) # gain → float
})
.sort_values("Gain", ascending=False)
.reset_index(drop=True)
.melt(id_vars="Feature", var_name="Type", value_name="Value") # stack the split and gain values
.assign(Percent=lambda d: d["Value"] / d.groupby("Type")["Value"].transform("sum") * 100,
Log_Value = lambda d: d["Value"].transform(np.log1p))
)
return imp_df
imp_features = get_feature_importance()
def plot_feature_importance(df, x="Percent"):
g = sns.catplot(df, y="Feature", x="Percent", col="Type", kind="bar", col_order=["Gain", "Split"],
sharex=False, height=9, aspect=0.7);
titles = ["Gain (Loss Reduction)", "Number of Splits"]
for ax, title in zip(g.axes.flatten(), titles):
ax.set_title(title)
ax.set_xlabel(None)
suptitle_stem = "Gain and Number of Splits by Features"
plt.suptitle(f"Percentage of {suptitle_stem}" if x=="Percent" else
f"Logarithm of {suptitle_stem}" if x=="Logarithm" else
suptitle_stem,
size=14, y=1.04);
plot_feature_importance(imp_features)
As it is often the case, only a handful of features contain most of the relevant information for the model. Many of these are newly added lagged features, especially those capturing weekly and, to a lesser extent, yearly seasonality patterns. The day of year, the product family, and the store number are most frequently used in splits. The model also makes use of several holiday-based features, such as the New Year flag, the distance to the next national holiday and its type, and the distance to Christmas.
We now keep only features that contribute 0.5% or more to the gain or to the number of splits to reduce noise and mitigate overfitting. Since the distance in days to the Manabí earthquake helps the model capture the source of some extremely high values, we also include it as a feature, along with the distance to Good Friday, as we saw that sales of alcoholic beverages increase during carnival, which is also linked to Good Friday. We also keep all membership scores to the series shape clusters as features, since three of them contribute more than 0.5%, but they are only meaningful as a complete block.
from collections.abc import Iterable
# remove lagged features (columns containing "sales" in their name), add sales for calculating lagged features in pipeline
relevant_feat = (imp_features
.loc[imp_features.Percent>0.5, "Feature"]
.drop_duplicates()
.sort_values()
.values)
# remove clipped suffix from dist features (to still match the clipping dictionary in the next pipeline fit)
relevant_feat = [feat.removesuffix("_clipped") if feat.endswith("_clipped") else feat for feat in relevant_feat]
# create list of features that are important
feature_cols_rel = sorted( # for better readability
list(
set(feat for feat in relevant_feat if not any(word in feat for word in ["sales", "sin", "cos"])) # using sets for uniqueness
.union({"sales", "dist_national_holiday", "dist_terremoto_manabi", "dist_viernes_santo"}) # add sales and features which are removing outliers
.union(set(col for col in X.columns if col.startswith("membership"))) # ensure that all membership features are used
)
) + fourier_cols # only make sense as a set, too
# updating the cat_cols
cat_cols_rel = list(
set(feature_cols_rel).intersection(set(cat_cols))
)
cat_cols_rel
# only distance features that are important
dist_cols_rel = list(
set(feature_cols_rel).intersection(set(dist_holiday_cols))
)
# extract argument names of lagged functions from relevant_feat
def get_lagged_args(cols: Iterable[str], lagged_patterns: list[str] = ["lag", "roll", "ewma"]) -> dict:
arg_dict = {}
for pattern in lagged_patterns:
lagged_cols = [col for col in cols if pattern in col]
if pattern == "roll":
roll_dict = {}
for col in lagged_cols:
match = re.search(f"_{pattern}(\\d+)(\\w+)", col)
n, period = int(match.group(1)), match.group(2)
roll_dict.setdefault(period, []).append(n)
spec_arg = create_seas_roll_dict(roll_dict)
else:
spec_arg = [int(re.search(f"_{pattern}(\\d+)", col).group(1)) for col in lagged_cols]
arg_dict[pattern] = spec_arg
return arg_dict
rel_lagged_args = get_lagged_args(relevant_feat)
print(f"Relevant Feature Columns:\n{feature_cols_rel}\n\n"
f"Releveant Arguments for Lagged Transformer:\n{rel_lagged_args}")
Relevant Feature Columns:
['day', 'days_elapsed', 'dist_national_holiday', 'dist_terremoto_manabi', 'dist_viernes_santo', 'family', 'is_new_year', 'membership_cluster0_85', 'membership_cluster1_85', 'membership_cluster2_85', 'membership_cluster3_85', 'membership_cluster4_85', 'membership_cluster5_85', 'month', 'national_type', 'onpromotion', 'sales', 'store_nbr', 'weekday', 'day_of_year']
Releveant Arguments for Lagged Transformer:
{'lag': [1, 14, 2, 21, 28, 3, 365, 366, 4, 5, 6, 7, 8, 9], 'roll': {'roll3years': [365, 730, 1095], 'roll5weeks': [7, 14, 21, 28, 35]}, 'ewma': [14, 30, 365, 7]}
Now the variables and function arguments are set to fit a model using only the relevant features, we clone the previous pipeline and adjust only the parts that need to change.
X_rel = X[feature_cols_rel]
rel_recursive_pipe = clone(recursive_lgbm_pipe)
# use new category cols
pre = rel_recursive_pipe.get_params()["pre"]
for i, (name, trans, cols) in enumerate(pre.transformers):
if name == "cat":
pre.transformers[i] = (name, trans, cat_cols_rel)
if name == "dist_clip":
pre.transformers[i] = (name, trans, dist_cols_rel)
rel_recursive_pipe.set_params(
pre=pre,
pre__lagged_features__laggedfeaturestransformer__lags=rel_lagged_args["lag"],
pre__lagged_features__laggedfeaturestransformer__seasonal_roll_dict=rel_lagged_args["roll"],
pre__lagged_features__laggedfeaturestransformer__ewma_spans=rel_lagged_args["ewma"]
)
preproc = rel_recursive_pipe[:-1] # only pipelines not the the TransformedTargetRegressor
preproc.fit(X_rel, y)
def get_cat_idx(preprocessor, cat_cols):
feature_names = preprocessor.get_feature_names_out()
return np.where(np.isin(feature_names, cat_cols))[0].tolist()
cat_idx_rel = get_cat_idx(preproc, cat_cols_rel)
rel_recursive_pipe
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('lagged_features',
Pipeline(steps=[('pipeline',
Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<funct...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=309))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('recursive_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('lagged_features', ...), ('cat', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['sales']
Parameters
| steps | [('functiontransformer-1', ...), ('functiontransformer-2', ...), ...] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 309 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Parameters
| num_series | 309 | |
| horizon | 14 | |
| lags | [1, 14, ...] | |
| seasonal_roll_dict | {'roll3years': [365, 730, ...], 'roll5weeks': [7, 14, ...]} | |
| ewma_spans | [14, 30, ...] |
['is_new_year', 'store_nbr', 'national_type', 'month', 'day', 'family']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x3b42d00d0>
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_national_holiday', 'dist_terremoto_manabi', 'dist_viernes_santo']
Parameters
| func | <function cli...t 0x3b427fac0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x3b427fb50> | |
| kw_args | None | |
| inv_kw_args | None |
['days_elapsed']
passthrough
Parameters
| regressor | RecursiveRegr...um_series=309) | |
| transformer | Pipeline(step...groups=309))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
RecursiveRegressor(base_estimator=LGBMRegressor(force_row_wise=True,
random_state=42, verbosity=0),
num_series=309)LGBMRegressor(force_row_wise=True, random_state=42, verbosity=0)
Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 31 | |
| max_depth | -1 | |
| learning_rate | 0.1 | |
| n_estimators | 100 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0 | |
| min_child_weight | 0.001 | |
| min_child_samples | 20 | |
| subsample | 1.0 | |
| subsample_freq | 0 | |
| colsample_bytree | 1.0 | |
| reg_alpha | 0.0 | |
| reg_lambda | 0.0 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 309 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
n_recursive_features = recursive_lgbm_pipe[:-1].get_feature_names_out().size
n_trimmed_recursive_features = rel_recursive_pipe[:-1].get_feature_names_out().size
print(f"Number of recursive features: {n_recursive_features}\n"
f"Number of trimmed recursive features: {n_trimmed_recursive_features}")
Number of recursive features: 64 Number of trimmed recursive features: 47
We reduced the number of features by approximately one third. Now it is time to cross-validate the sparser model:
# relevant features (0.5%) of original features (ewma 14 instead of 8) + all memberships + dist_terremoto_manabi + dist_viernes_santo (-56, 14)
rel_recursive_rmsles = -cross_val_score(rel_recursive_pipe,
X_rel,
y,
scoring="neg_root_mean_squared_log_error", cv=ts_cv,
params={"recursive_lgbm__categorical_feature": cat_idx_rel})
rel_recursive_rmsles
array([1.16184363, 0.4837483 , 0.5608136 , 0.41068487, 0.45299077])
rel_cv_improvements = cv_improvement(rel_recursive_rmsles, recursive_lgbm_rmsles)
rel_cv_improvements
array([-0.6, 0.7, 0.1, -0.4, 0.1])
The values stayed roughly the same. Thus, we successfully reduced the dimensionality without harming the model's predictive power at all. This shows that we removed the right features.
Temporal Fusion Transformer ↑¶
The Temporal Fusion Transformer (TFT) is a transformer-based forecasting model designed for multi-horizon time series with mixed static and time-varying features. TFT can dynamically select relevant covariates over time and leverage both known and unknown future inputs. Further — as a neural network — it is able to perform multi-horizon forecasts natively, making it well suited for our dataset.
I use its implementation of PyTorch Forecast, which is based on PyTorch Lightning, since it also handles non-trivial and tedious tasks such as sequence generation for multiple time series and model construction (I built a custom model and trainer class with base PyTorch in my MBTI project↗).
TFT also provides interpretable variable importance and attention over temporal patterns, which helps to understand the drivers of the forecast in the case TFT can beat the strong LightGBM baselines. But first, we have to define a dataset for training the TFT with a slightly modified sklearn preprocessing pipeline.
cat_cols_tft = ["day", "month", "is_new_year", "is_leap_year", "national_type", "store_nbr", "family"]
fourier_cols_tft = ["weekday", "day_of_year"]
unstandardized_cols = ["onpromotion"]
dist_holiday_cols_tft = ["dist_national_holiday", "dist_terremoto_manabi", "dist_viernes_santo"]
membership_cols = [f"membership_cluster{i}_85" for i in range(6)] # membership scores of shape clusters
cols_tft = np.array(
cat_cols_tft
+ fourier_cols_tft
+ unstandardized_cols
+ dist_holiday_cols_tft
+ membership_cols
+ ["days_elapsed", "sales"]
)
cols_tft
array(['day', 'month', 'is_new_year', 'is_leap_year', 'national_type',
'store_nbr', 'family', 'weekday', 'day_of_year', 'onpromotion',
'dist_national_holiday', 'dist_terremoto_manabi',
'dist_viernes_santo', 'membership_cluster0_85',
'membership_cluster1_85', 'membership_cluster2_85',
'membership_cluster3_85', 'membership_cluster4_85',
'membership_cluster5_85', 'days_elapsed', 'sales'], dtype='<U22')
Since training is resource-intensive we only use the last cross-validation fold for training and evaluation.
from sklearn.preprocessing import MinMaxScaler
X_tft = X[cols_tft] # contains "sales"!
X_train_tft = X_tft.iloc[ts_cv[-1][0]]
y_train_tft = y.iloc[ts_cv[-1][0]]
y_val_tft = y.iloc[ts_cv[-1][1]]
# y keeps just y, no changes made
num_groups = num_series_sample
# lag 365 is out of the sequence length. Inlcuding lags 7 & 28 explicitly often helps
# 366 not necessary as already in sequence of lag 365 for all but last day in sequence (info extractable via
# attention and is_leap_year flag)
lags_tft = [7, 28, 365]
# new instance of LaggedFeaturesTransformer with fewer lags
lagged_features_tft = LaggedFeaturesTransformer(
num_series=num_series_sample,
lags=lags_tft,
seasonal_roll_dict=seas_roll_dict,
ewma_spans=spans
)
# new pipeline for fewer lags
lagged_pipe = make_pipeline(
target_pipe, # log1p transformation + groupwise scaling
lagged_features_tft
)
# neural nets require scaled features
stand_pipe = make_pipeline(
to_float32,
MinMaxScaler()
)
# distances to holiday must be clipped and standardized
dist_pipe = make_pipeline(
to_float32,
clip_transformer,
MinMaxScaler()
)
def cast2string(X: pd.DataFrame) -> pd.DataFrame:
return X.astype(str)
to_string = FunctionTransformer(cast2string, feature_names_out="one-to-one")
pipe_tft = ColumnTransformer(
[
("lagged_features", lagged_pipe, ["sales"]),
("cat", to_string, cat_cols_tft), # day, month, holiday and basic (store_nbr, family, city, etc.) cols
("fourier", fourier_transformer, fourier_cols_tft),
("float32", to_float32, membership_cols + ["sales"]), # already normalized to 0-1
("standard", stand_pipe, unstandardized_cols),
("dist", dist_pipe, dist_holiday_cols_tft)
],
remainder="passthrough",
verbose_feature_names_out=False,
)
pipe_tft.fit_transform(X_train_tft)
df_tft = pipe_tft.transform(X_tft)
df_tft
| sales_log1p_groupscaled_lag7 | sales_log1p_groupscaled_lag28 | sales_log1p_groupscaled_lag365 | sales_log1p_groupscaled_roll5weeks | sales_log1p_groupscaled_roll3months | sales_log1p_groupscaled_roll3years | sales_log1p_groupscaled_ewma7 | sales_log1p_groupscaled_ewma14 | sales_log1p_groupscaled_ewma30 | sales_log1p_groupscaled_ewma365 | ... | membership_cluster2_85 | membership_cluster3_85 | membership_cluster4_85 | membership_cluster5_85 | sales | onpromotion | dist_national_holiday_clipped | dist_terremoto_manabi_clipped | dist_viernes_santo_clipped | days_elapsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6752 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.067289 | -0.035887 | -0.017365 | -0.001471 | ... | 3.078225e-14 | 2.072993e-07 | 8.377087e-25 | 4.430785e-11 | 0.000 | 0.000000 | 0.500000 | 0.0 | 0.0 | 305 |
| 6753 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.067289 | -0.035887 | -0.017365 | -0.001471 | ... | 3.078225e-14 | 2.072993e-07 | 8.377087e-25 | 4.430785e-11 | 0.000 | 0.000000 | 0.571429 | 0.0 | 0.0 | 306 |
| 6754 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.117755 | -0.066990 | -0.033609 | -0.002934 | ... | 3.078225e-14 | 2.072993e-07 | 8.377087e-25 | 4.430785e-11 | 0.000 | 0.000000 | 0.642857 | 0.0 | 0.0 | 307 |
| 6755 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.155605 | -0.093945 | -0.048806 | -0.004388 | ... | 3.078225e-14 | 2.072993e-07 | 8.377087e-25 | 4.430785e-11 | 0.000 | 0.000000 | 0.714286 | 0.0 | 0.0 | 308 |
| 6756 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.269154 | -0.183992 | -0.117306 | -0.063022 | -0.005835 | ... | 3.078225e-14 | 2.072993e-07 | 8.377087e-25 | 4.430785e-11 | 0.000 | 0.000000 | 0.785714 | 0.0 | 0.0 | 309 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3007997 | 0.317381 | 0.690590 | -0.647839 | 0.214344 | 0.518538 | -0.619026 | -0.023681 | 0.062262 | 0.121383 | 0.132693 | ... | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 11.000 | 0.000000 | 0.000000 | 1.0 | 1.0 | 1974 |
| 3007998 | 1.200722 | 1.303128 | -0.124038 | 0.823626 | 0.085624 | -0.586271 | -0.108160 | 0.005747 | 0.090223 | 0.129992 | ... | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 21.916 | 0.005398 | 0.000000 | 1.0 | 1.0 | 1975 |
| 3007999 | 0.349699 | 1.317503 | 1.254984 | 0.620278 | 0.996349 | 0.114409 | 0.082678 | 0.092340 | 0.126673 | 0.132862 | ... | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 19.909 | 0.000000 | 0.000000 | 1.0 | 1.0 | 1976 |
| 3008000 | 0.191379 | 0.304661 | 1.867592 | -0.114489 | 1.056888 | 0.996320 | 0.189791 | 0.148179 | 0.151477 | 0.134929 | ... | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 12.000 | 0.000000 | 0.000000 | 1.0 | 1.0 | 1977 |
| 3008001 | 0.308342 | -1.425819 | 0.099764 | -0.514868 | 0.384421 | 0.392825 | 0.083395 | 0.096983 | 0.126491 | 0.132903 | ... | 1.172952e-07 | 3.513111e-01 | 1.142362e-03 | 5.823184e-09 | 19.316 | 0.000000 | 0.000000 | 1.0 | 1.0 | 1978 |
517266 rows × 39 columns
Based on this preprocessed dataset, we can now create with TimeSeriesDataset a training and validation dataset, from which we also create the corresponding data loaders.
from pytorch_forecasting import TimeSeriesDataSet
from pytorch_forecasting.data.encoders import GroupNormalizer
max_prediction_length=14
max_encoder_length=90
lagged_features_tft = [col for col in df_tft if col.startswith("sales_log1p_groupscaled_")]
# distinction between known and unknown lagged features:
# all features using only lags >= horizon:
known_lagged_features = [feature for feature in lagged_features_tft
if re.search("lag28|lag365|roll3months|roll3years", feature)]
# all features using at least one lag < horizon
unknown_lagged_features = list(set(lagged_features_tft) - set(known_lagged_features)) # all ewmas, lag 7 and roll5weeks
harmonics_tft = [col for col in df_tft if re.match("weekday|day_of_year", col)] # fourier harmonics
time_varying_known_reals = lagged_features_tft + harmonics_tft + ["days_elapsed", "onpromotion"]
# define the datasets
training = TimeSeriesDataSet(
df_tft.iloc[ts_cv[-1][0]], # only last fold's training set
time_idx="days_elapsed",
target="sales",
group_ids=["store_nbr", "family"],
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
static_categoricals=["store_nbr", "family"],
static_reals=membership_cols,
time_varying_known_categoricals=["day", "month", "is_new_year", "is_leap_year", "national_type"],
time_varying_known_reals=known_lagged_features + harmonics_tft + ["days_elapsed", "onpromotion"],
time_varying_unknown_reals=unknown_lagged_features + ["sales"],
target_normalizer=GroupNormalizer(groups=["store_nbr", "family"], transformation="log1p"),
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True
)
validation = TimeSeriesDataSet.from_dataset(
training, df_tft, predict=True, stop_randomization=True
)
# move data to dataloaders
batch_size = 64
n_workers = 0 # >0 can lead to multiprocessing issues
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=n_workers,
persistent_workers=False) # can cause kernel crushes on M1 macs
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 2, num_workers=n_workers,
persistent_workers=False)
Before defining the TFT model for training we create a dummy one from the dataset that we can use to find a good learning rate for training.
import lightning.pytorch as pl
from lightning.pytorch.tuner import Tuner
from pytorch_forecasting import TemporalFusionTransformer
from pytorch_forecasting.metrics import RMSE, MAE
import torch.nn as nn
import warnings
# avoid printing out warning about num_workers, which are safest with num_workers=0 in the dataloaders
warnings.filterwarnings("default")
# configure network and trainer
pl.seed_everything(14)
trainer = pl.Trainer(
accelerator="mps",
gradient_clip_val=0.1,
max_epochs=2
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=1e-3,
hidden_size=8,
attention_head_size=2,
dropout=0.1,
hidden_continuous_size=8, # set to <= hidden_size
loss=RMSE(),
logging_metrics=nn.ModuleList([RMSE(), MAE()]),
optimizer="adamw",
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")
res = Tuner(trainer).lr_find(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
max_lr=1,
min_lr=1e-6,
)
print(f"suggested learning rate: {res.suggestion()}")
fig = res.plot(show=True, suggest=True)
Seed set to 14 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry. GPU available: True (mps), used: True TPU available: False, using: 0 TPU cores /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`. /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'logging_metrics' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['logging_metrics'])`.
Number of parameters in network: 26.2k
/Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance. /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance. Finding best initial lr: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:23<00:00, 4.40it/s]`Trainer.fit` stopped: `max_steps=100` reached. Finding best initial lr: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:23<00:00, 4.24it/s] Restoring states from the checkpoint path at /Users/moritzgrimm/Documents/GitHub/Favoritas/.lr_find_4609601b-b15b-457c-8264-8b2546f80237.ckpt Restored all states from the checkpoint at /Users/moritzgrimm/Documents/GitHub/Favoritas/.lr_find_4609601b-b15b-457c-8264-8b2546f80237.ckpt /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance. Learning rate set to 6.025595860743577e-06
suggested learning rate: 6.025595860743577e-06
The LR range test curve looks slightly unreliable, with similar loss values for learning rates roughly between $5*10^{-6}$ and $10^{0}$. Within this region the loss changes only moderately, indicating that the model is not very sensitive to the exact learning rate choice.
I therefore chose a conservative learning rate of $1*10^{-3}$, which lies well within this stable plateau. In practice, this learning rate trained stably without divergence and provided good convergence speed, so I used it for all TFT experiments.
I also chose a small hidden size of only 8 neurons and just 2 attention heads to reduce model complexity, together with a relatively high dropout rate of 0.3, since in initial experiments the model tended to overfit immediately from the start. We now define the actual model for training:
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor, TQDMProgressBar
from lightning.pytorch.loggers import TensorBoardLogger
import lightning.pytorch as pl
from pytorch_forecasting import TemporalFusionTransformer
from pytorch_forecasting.metrics import RMSE, MAE
import torch.nn as nn
from lightning.pytorch.callbacks import ModelCheckpoint
# configure network and trainer
early_stop_callback = EarlyStopping(
monitor="val_loss",
min_delta=1e-4,
patience=10,
mode="min"
)
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger(save_dir=".", # ensures project dir, avoids creating a nested subdir "lightning_logs"
name="lightning_logs")
# save best model in checkpoint
checkpoint_callback = ModelCheckpoint(
dirpath=None, # to force the checkpoint to be on the top directory
monitor="val_loss",
mode="min",
save_top_k=1,
)
# configure network and trainer
pl.seed_everything(14)
trainer = pl.Trainer(
accelerator="mps",
gradient_clip_val=1.0,
enable_model_summary=True,
max_epochs=40,
callbacks=[checkpoint_callback, early_stop_callback, lr_logger, TQDMProgressBar(refresh_rate=10)],
logger=logger,
)
print("log_dir:", trainer.logger.log_dir)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=1e-3,
hidden_size=8,
attention_head_size=2,
dropout=0.3,
hidden_continuous_size=8, # set to <= hidden_size
loss=MAE(), # RMSE can make training more unstable
logging_metrics=nn.ModuleList([RMSE(), MAE()]),
optimizer="adamw",
reduce_on_plateau_patience=3,
reduce_on_plateau_min_lr=1e-6,
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")
Seed set to 14 GPU available: True (mps), used: True TPU available: False, using: 0 TPU cores
log_dir: ./lightning_logs/version_70
/Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`. /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'logging_metrics' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['logging_metrics'])`.
Number of parameters in network: 26.2k
trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
| Name | Type | Params | Mode ------------------------------------------------------------------------------------------------ 0 | loss | MAE | 0 | train 1 | logging_metrics | ModuleList | 0 | train 2 | input_embeddings | MultiEmbedding | 1.0 K | train 3 | prescalers | ModuleDict | 528 | train 4 | static_variable_selection | VariableSelectionNetwork | 4.0 K | train 5 | encoder_variable_selection | VariableSelectionNetwork | 10.3 K | train 6 | decoder_variable_selection | VariableSelectionNetwork | 7.5 K | train 7 | static_context_variable_selection | GatedResidualNetwork | 304 | train 8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 304 | train 9 | static_context_initial_cell_lstm | GatedResidualNetwork | 304 | train 10 | static_context_enrichment | GatedResidualNetwork | 304 | train 11 | lstm_encoder | LSTM | 576 | train 12 | lstm_decoder | LSTM | 576 | train 13 | post_lstm_gate_encoder | GatedLinearUnit | 144 | train 14 | post_lstm_add_norm_encoder | AddNorm | 16 | train 15 | static_enrichment | GatedResidualNetwork | 368 | train 16 | multihead_attn | InterpretableMultiHeadAttention | 212 | train 17 | post_attn_gate_norm | GateAddNorm | 160 | train 18 | pos_wise_ff | GatedResidualNetwork | 304 | train 19 | pre_output_gate_norm | GateAddNorm | 160 | train 20 | output_layer | Linear | 9 | train ------------------------------------------------------------------------------------------------ 26.2 K Trainable params 0 Non-trainable params 26.2 K Total params 0.105 Total estimated model params size (MB) 739 Modules in train mode 0 Modules in eval mode
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
/Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 7517/7517 [27:31<00:00, 4.55it/s, v_num=70, train_loss_step=50.40] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 4.38it/s] Epoch 1: 100%|█████████████████████████████████████████████████| 7517/7517 [27:43<00:00, 4.52it/s, v_num=70, train_loss_step=33.10, val_loss=84.10, train_loss_epoch=69.00] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.45it/s] Epoch 2: 100%|█████████████████████████████████████████████████| 7517/7517 [27:44<00:00, 4.52it/s, v_num=70, train_loss_step=76.40, val_loss=80.80, train_loss_epoch=56.50] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.00it/s] Epoch 3: 100%|█████████████████████████████████████████████████| 7517/7517 [27:43<00:00, 4.52it/s, v_num=70, train_loss_step=34.50, val_loss=84.00, train_loss_epoch=53.40] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8.55it/s] Epoch 4: 100%|█████████████████████████████████████████████████| 7517/7517 [27:43<00:00, 4.52it/s, v_num=70, train_loss_step=34.70, val_loss=77.90, train_loss_epoch=51.50] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7.55it/s] Epoch 5: 100%|█████████████████████████████████████████████████| 7517/7517 [27:41<00:00, 4.52it/s, v_num=70, train_loss_step=43.80, val_loss=81.50, train_loss_epoch=50.00] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.04it/s] Epoch 6: 100%|█████████████████████████████████████████████████| 7517/7517 [27:44<00:00, 4.52it/s, v_num=70, train_loss_step=35.80, val_loss=78.50, train_loss_epoch=48.90] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.46it/s] Epoch 7: 100%|█████████████████████████████████████████████████| 7517/7517 [27:44<00:00, 4.52it/s, v_num=70, train_loss_step=30.00, val_loss=80.50, train_loss_epoch=47.90] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.13it/s] Epoch 8: 100%|█████████████████████████████████████████████████| 7517/7517 [27:47<00:00, 4.51it/s, v_num=70, train_loss_step=52.20, val_loss=79.90, train_loss_epoch=47.10] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.11it/s] Epoch 9: 100%|█████████████████████████████████████████████████| 7517/7517 [27:57<00:00, 4.48it/s, v_num=70, train_loss_step=36.30, val_loss=78.70, train_loss_epoch=45.60] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.55it/s] Epoch 10: 100%|████████████████████████████████████████████████| 7517/7517 [27:55<00:00, 4.49it/s, v_num=70, train_loss_step=71.50, val_loss=80.10, train_loss_epoch=45.30] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.69it/s] Epoch 11: 100%|████████████████████████████████████████████████| 7517/7517 [27:57<00:00, 4.48it/s, v_num=70, train_loss_step=39.60, val_loss=77.90, train_loss_epoch=45.00] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.64it/s] Epoch 12: 100%|████████████████████████████████████████████████| 7517/7517 [27:57<00:00, 4.48it/s, v_num=70, train_loss_step=27.00, val_loss=78.50, train_loss_epoch=44.80] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.80it/s] Epoch 13: 100%|████████████████████████████████████████████████| 7517/7517 [28:00<00:00, 4.47it/s, v_num=70, train_loss_step=30.20, val_loss=76.80, train_loss_epoch=44.60] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8.81it/s] Epoch 14: 100%|████████████████████████████████████████████████| 7517/7517 [27:57<00:00, 4.48it/s, v_num=70, train_loss_step=56.90, val_loss=81.20, train_loss_epoch=44.30] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.28it/s] Epoch 15: 100%|████████████████████████████████████████████████| 7517/7517 [28:00<00:00, 4.47it/s, v_num=70, train_loss_step=48.40, val_loss=79.50, train_loss_epoch=44.20] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.77it/s] Epoch 16: 100%|████████████████████████████████████████████████| 7517/7517 [28:01<00:00, 4.47it/s, v_num=70, train_loss_step=45.70, val_loss=75.70, train_loss_epoch=44.00] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7.06it/s] Epoch 17: 100%|████████████████████████████████████████████████| 7517/7517 [28:01<00:00, 4.47it/s, v_num=70, train_loss_step=44.00, val_loss=79.50, train_loss_epoch=43.90] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.54it/s] Epoch 18: 100%|████████████████████████████████████████████████| 7517/7517 [27:58<00:00, 4.48it/s, v_num=70, train_loss_step=21.90, val_loss=76.70, train_loss_epoch=43.70] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8.54it/s] Epoch 19: 100%|████████████████████████████████████████████████| 7517/7517 [28:00<00:00, 4.47it/s, v_num=70, train_loss_step=48.90, val_loss=79.40, train_loss_epoch=43.60] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.08it/s] Epoch 20: 100%|████████████████████████████████████████████████| 7517/7517 [27:58<00:00, 4.48it/s, v_num=70, train_loss_step=117.0, val_loss=75.80, train_loss_epoch=43.50] Validation: | | 0/? [00:00<?, ?it/s] Validation: | | 0/? [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/3 [00:00<?, ?it/s] Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9.24it/s] Epoch 21: 39%|██████████████████▊ | 2950/7517 [10:59<17:00, 4.47it/s, v_num=70, train_loss_step=30.40, val_loss=78.00, train_loss_epoch=42.90]
Training stopped due to a local kernel crash after epoch 19 (probably due to a full MPS memory which is a known issue on Macbooks). Since the best model checkpoint was saved automatically (ModelCheckpoint, monitor=val_loss), we use the best saved checkpoint for evaluation.
Now let us have a look on the performance of this model on the validation set:
from pytorch_forecasting import TemporalFusionTransformer
# Loading model
best_path = './lightning_logs/version_70/checkpoints/epoch=15-step=120272.ckpt'
model = TemporalFusionTransformer.load_from_checkpoint(best_path)
predictions = model.predict(
val_dataloader, return_y=True, trainer_kwargs=dict(accelerator="cpu")
)
output, y_tft = [t.ravel().numpy() for t in [predictions.output, predictions.y[0]]]
get_metrics(y_tft, output)
# check are x["decoder_target"] and predictions.y the same?
/Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`. /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'logging_metrics' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['logging_metrics'])`. 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry. GPU available: True (mps), used: False TPU available: False, using: 0 TPU cores /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:166: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`. /Users/moritzgrimm/miniforge3/envs/Favoritas/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
{'RMSLE': 0.6298, 'MAE': 78.472}
Even with perfect training it is very unlikely that the model could beat the last-14-days baseline of 0.56, not to mention the even lower values for the simple (0.44) and recursive (0.45) LightGBM model. Additionally, despite the reduced complexity, the model still overfits very early, while converging nicely on the training data:
from tensorboard.backend.event_processing import event_accumulator
def print_tb_losses(event_path: str):
ea = event_accumulator.EventAccumulator(event_path)
ea.Reload()
train_loss = ea.Scalars("train_loss_epoch")
val_loss = ea.Scalars("val_loss")
metrics = {"Training": train_loss, "Validation": val_loss}
df = pd.DataFrame(
[{"Epoch": num, "Step": e.step, "wall_time": e.wall_time, "Value": e.value, "Set": dataset}
for dataset, metric in metrics.items() for num, e in enumerate(metric)]
)
plt.figure(figsize=(10, 4))
sns.lineplot(df, x="Epoch", y="Value", hue="Set");
best_epoch = (
df.loc[
df.Value == df[df.Set == "Validation"].Value.min(),
"Epoch"
]
.values
.item()
)
y_min = df.Value.min() - 10
y_max = df.Value.max() + 10
plt.vlines(x=best_epoch, ymin=y_min, ymax=y_max, color="grey");
plt.annotate("Best epoch", xy=(best_epoch + 0.1, y_max - 10), color="grey")
x_tick_labels = range(0, df.Epoch.max() + 1, 2)
plt.xticks(ticks=x_tick_labels, labels=x_tick_labels);
plt.xlim(-0.5, None);
plt.ylim(y_min, y_max);
print_tb_losses("lightning_logs/version_70/events.out.tfevents.1768242820.Moritzs-MBP.2009.1")
We could try to reduce the complexity even further, but it is very unlikely that the TFT will beat the LightGBM models. It might perform better on the full dataset, since as a neural network it benefits more from richer training data than tree-based models, but this remains unclear and would require even more training time.
On noisy datasets with many covariates like this, tree-based boosting models often outperform deep learning appproaches. They are particularly effective at capturing complex conditional interactions and handle noise well by learning stable, piecewise decision rules across regimes. TFT, in contrast, performs best under more temporal stability; its variable selection and attention mechanisms can become unstable and over-responsive to noise in highly non-stationary settings.
Therefore, as shown in the following table, the simple LightGBM model is the most reliable performer so far:
| Predictor | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | SD |
|---|---|---|---|---|---|---|---|
| Last-14-Days Baseline | 1.9082 | 0.6341 | 0.7302 | 0.4984 | 0.5607 | 0.8663 | 0.5266 |
| Last-Year Baseline | 0.6776 | 0.7373 | 0.8180 | 0.9236 | 0.9726 | 0.8258 | 0.1104 |
| Simple LGBM | 0.7913 | 0.5082 | 0.5533 | 0.5274 | 0.4438 | 0.5648 | 0.1189 |
| Recursive LGBM | 1.1548 | 0.4870 | 0.5616 | 0.4092 | 0.4535 | 0.6132 | 0.2754 |
| TFT | – | – | – | – | 0.6298 | – | – |
Since this reliability is largely driven by the much stronger performance in the first fold, and the simple model beats the recursive one only hardly on folds 3 and 5, we perform one final check to compare these two models on the full dataset, before declaring the simple model as the final choice and tuning its hyperparameters on the full dataset.
X_full_all_features = train[feature_cols]
y_full = train[["sales"]]
ts_cv_full = list(block_tscv_gen(X_full_all_features, num_series, 14, 5))
full_simple_all_features_pipe = clone(simple_lgbm_pipe)
full_simple_all_features_pipe.set_params(
simple_lgbm__transformer__groupstandardscaler__num_groups=num_series
)
full_simple_all_features_pipe
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OrdinalEncoder(dtype=<class 'int'>,
handle_unknown='use_encoded_value',
unknown_value=-1),
['day', 'month',
'is_new_year', 'local_type',
'regional_type',
'national_type',
'is_leap_year', 'store_nbr',
'family', 'city', 'type',
'cluster']),
('fourier',
FourierTransformer(periods_har...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=1782))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('simple_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('cat', ...), ('fourier', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x3b42d00d0>
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x3b427fac0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x3b427fb50> | |
| kw_args | None | |
| inv_kw_args | None |
['sales']
drop
passthrough
Parameters
| regressor | LGBMRegressor..., verbosity=0) | |
| transformer | Pipeline(step...roups=1782))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
LGBMRegressor(force_row_wise=True, random_state=42, verbosity=0)
Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 31 | |
| max_depth | -1 | |
| learning_rate | 0.1 | |
| n_estimators | 100 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0 | |
| min_child_weight | 0.001 | |
| min_child_samples | 20 | |
| subsample | 1.0 | |
| subsample_freq | 0 | |
| colsample_bytree | 1.0 | |
| reg_alpha | 0.0 | |
| reg_lambda | 0.0 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
full_simple_all_features_rmsles = -cross_val_score(
full_simple_all_features_pipe, X_full_all_features, y_full,
scoring="neg_root_mean_squared_log_error",
cv=ts_cv_full,
params={"simple_lgbm__categorical_feature": cat_cols}
)
def print_cv_results(labels):
width = max(len(label) for label in labels.keys())
print("\n".join(
f"{name:<{width}}: " + ", ".join(f"{v:.4f}" for v in vals)
for name, vals in labels.items()
))
simple_labels = {"Simple LGBM on Full Dataset": full_simple_all_features_rmsles,
"Simple LGBM on Sample": lgbm_rmsles}
print_cv_results(simple_labels)
Simple LGBM on Full Dataset: 0.8582, 0.6068, 0.5970, 0.4812, 0.4947 Simple LGBM on Sample : 0.7913, 0.5082, 0.5533, 0.5274, 0.4438
As expected, training on the full dataset leads to different results, and the balanced sample seems to contain series for which the regime shifts in the first fold (post-Christmas) and third fold (around Easter) are easier to predict. Nevertheless, by increasing the complexity of the model to account for the richer data, we may be able to close these gaps.
Before comparing cross-validation results, we also evaluate the recursive model's performance on the full dataset and build the pipeline for the recursive model as well:
full_rec_all_features_pipe = clone(recursive_lgbm_pipe)
full_rec_all_features_pipe.set_params(
pre__lagged_features__pipeline__groupstandardscaler__num_groups=num_series,
pre__lagged_features__laggedfeaturestransformer__num_series=num_series,
recursive_lgbm__transformer__groupstandardscaler__num_groups=num_series,
recursive_lgbm__regressor__num_series=num_series
)
full_rec_all_features_pipe
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('lagged_features',
Pipeline(steps=[('pipeline',
Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<funct...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x3b427fbe0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x36ca09120>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=1782))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('recursive_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('lagged_features', ...), ('cat', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['sales']
Parameters
| steps | [('functiontransformer-1', ...), ('functiontransformer-2', ...), ...] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Parameters
| num_series | 1782 | |
| horizon | 14 | |
| lags | [1, 2, ...] | |
| seasonal_roll_dict | {'roll3months': [30, 60, ...], 'roll3years': [365, 730, ...], 'roll5weeks': [7, 14, ...]} | |
| ewma_spans | [7, 14, ...] |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x3b42d00d0>
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x3b427fac0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x3b427fb50> | |
| kw_args | None | |
| inv_kw_args | None |
passthrough
Parameters
| regressor | RecursiveRegr...m_series=1782) | |
| transformer | Pipeline(step...roups=1782))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
RecursiveRegressor(base_estimator=LGBMRegressor(force_row_wise=True,
random_state=42, verbosity=0),
num_series=1782)LGBMRegressor(force_row_wise=True, random_state=42, verbosity=0)
Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 31 | |
| max_depth | -1 | |
| learning_rate | 0.1 | |
| n_estimators | 100 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0 | |
| min_child_weight | 0.001 | |
| min_child_samples | 20 | |
| subsample | 1.0 | |
| subsample_freq | 0 | |
| colsample_bytree | 1.0 | |
| reg_alpha | 0.0 | |
| reg_lambda | 0.0 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x3b427fbe0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x36ca09120> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
full_rec_all_features_rmsles = -cross_val_score(
full_rec_all_features_pipe, X_full_all_features, y_full,
scoring="neg_root_mean_squared_log_error",
cv=ts_cv_full,
params={"recursive_lgbm__categorical_feature": cat_idx}
)
recursive_labels = {"Recursive LGBM on Full Dataset": full_rec_all_features_rmsles,
"Recursive LGBM on Sample": recursive_lgbm_rmsles}
print_cv_results(recursive_labels)
Recursive LGBM on Full Dataset: 1.1550, 0.5085, 0.5811, 0.4297, 0.4446 Recursive LGBM on Sample : 1.1548, 0.4870, 0.5616, 0.4092, 0.4535
The recursive model also has more problems with the third fold and generally performs similarly on the full dataset. However, the first fold remains catastrophic.
Maybe we can mitigate this by using a small ensemble of the simple and recursive model, averaging the forecasts of both. For clear evaluation, we also need the baselines for the full dataset and show all relevant information in a plot and in a table:
ensemble_rmsles = []
for fold in ts_cv_full:
simple = clone(full_simple_all_features_pipe)
simple.fit(X_full_all_features.iloc[fold[0]], y_full.iloc[fold[0]], simple_lgbm__categorical_feature=cat_cols)
y_pred_simple = simple.predict(X_full_all_features.iloc[fold[1]])
rec = clone(full_rec_all_features_pipe)
rec.fit(X_full_all_features.iloc[fold[0]], y_full.iloc[fold[0]], recursive_lgbm__categorical_feature=cat_idx)
y_pred_rec = rec.predict(X_full_all_features.iloc[fold[1]])
y_pred_ensemble = 0.5 * y_pred_simple + 0.5 * y_pred_rec
ensemble_rmsles.append(root_mean_squared_log_error(y_full.iloc[fold[1]], y_pred_ensemble))
ensemble_rmsles = np.array(ensemble_rmsles)
full_cv_14days_baseline = cv_lag14_baselines(y_full, ts_cv_full)
full_cv_last_year_baseline = cv_last_year_baselines(X_full_all_features, y_full, ts_cv_full)
labels = {"Last 14 days baseline": full_cv_14days_baseline,
"Last year baseline": full_cv_last_year_baseline,
"Simple LGBM": full_simple_all_features_rmsles,
"Recursive LGBM": full_rec_all_features_rmsles,
"Ensemble": ensemble_rmsles}
# plot
sns.lineplot(
pd.DataFrame(labels)
.assign(Fold=range(1, 6))
.melt(id_vars="Fold", var_name="Metric", value_name="RMSLE"),
x="Fold", y="RMSLE", hue="Metric", style="Metric", dashes=True, markers=True);
plt.xticks(ticks=range(1, 6), labels=range(1, 6));
plt.grid(visible=False, axis="x");
plt.ylim(0.4, 2);
print_cv_results(labels)
Last 14 days baseline: 1.8825, 0.6520, 0.7466, 0.5330, 0.5386 Last year baseline : 0.6925, 0.7542, 0.8434, 0.9435, 0.9728 Simple LGBM : 0.8582, 0.6068, 0.5970, 0.4812, 0.4947 Recursive LGBM : 1.1550, 0.5085, 0.5811, 0.4297, 0.4446 Ensemble : 1.0383, 0.5073, 0.5755, 0.4258, 0.4345
All models — including the ensemble – are unable to beat the last year baseline for the first fold, but they do beat both baselines on the remaining folds. While the ensemble benefits strongly from the recurisve model's better performance on these last four folds, the simple model unfortunately does not pull the score closer to its own value on the first fold. We could try to address that by exclusively tuning the ensemble (or the recursive model) on the first fold first and then using a similar parameter space when optimizing for the mean across all folds. However, it remains unrealistic that these models will get much closer to the baseline, since the gap in the first fold is still very large.
For now, these models remain too unreliable, and while it is tempting to tune only on the last fold to achieve better results on the test set, we proceed to tune the much more reliable simple model on the mean across all folds in order to preserve this robustness. We do this using Optuna, since it applies Bayesian optimization techniques and is much more efficient than standard scikit-learn grid search or random search on the one hand and it allows the use of Optuna Dashboard for better investigation of the optimization trials.
Before doing that, it is worth noting how fast the recursive model can generate predictions for more than 25 000 rows, thanks to its fully vectorized implementation:
import timeit
# take last fold's validation set
idx_val = ts_cv_full[-1][1] # example for realistic forecast size
full_rec_all = clone(full_rec_all_features_pipe)
full_rec_all.fit(X_full_all_features, y_full, recursive_lgbm__categorical_feature=cat_idx)
def predict_code():
full_rec_all.predict(X_full_all_features.iloc[idx_val])
pass
secs = timeit.timeit(predict_code, number=100)
print(f"Total: {secs:.3f}s | per run: {secs/100:.6f}s")
Total: 90.488s | per run: 0.904882s
Needing less than one second per forecast means real-time prediction. Even if this is not necessary in daily forecasts, it shows that in applications where fast predictions are mandatory and no harsh regime shifts occur, this type of model could be highly advantageous.
We now begin the hyperparameter tuning using Optuna.
import optuna
stem = "simple_lgbm__regressor__"
def objective(trial):
max_depth = trial.suggest_int("max_depth", 3, 64)
# prevent optuna from creating structurally infeasible trees
# higher number of leaves than possible would make the search space redundant
max_num_leaves = min(2**max_depth, 255) # num leaves at given depth is 2**depth
min_num_leaves = min(max_num_leaves - 1, 31) # when depth is small, trees should be almost full
params = {
# complexity
"num_leaves": trial.suggest_int("num_leaves", min_num_leaves, max_num_leaves),
"max_depth": max_depth,
"min_child_samples": trial.suggest_int("min_child_samples", 100, 1000), # min_data_in_leaf
# learning dynamics
"learning_rate": trial.suggest_float("learning_rate", 0.03, 0.3, log=True),
"n_estimators": trial.suggest_int("n_estimators", 50, 300), # 100 worked already quite well
# sampling/regularization
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.7, 1.0), # sample ratio of cols
"subsample": trial.suggest_float("subsample", 0.7, 1.0), # sample size
"reg_alpha": trial.suggest_float("reg_alpha", 1e-3, 1, log=True), # l1 reg
"reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 1, log=True), # l2 reg
"min_split_gain": trial.suggest_float("min_split_gain", 0.0, 0.2),
# parallelization
"n_jobs": 4,
# "early_stopping_round": 50, # es requires first fitting to X_tr -> transform X_val for each fold additionally
"verbosity": -1
}
full_params = {stem + k: v for k, v in params.items()}
model = clone(full_simple_all_features_pipe)
model.set_params(**full_params)
model.set_params(**{
stem + "n_jobs": 4,
stem + "verbose": -1
})
rmsles = -cross_val_score(
model,
X_full_all_features, y_full,
scoring="neg_root_mean_squared_log_error",
cv=ts_cv_full,
params={"simple_lgbm__categorical_feature": cat_cols},
)
trial.set_user_attr("fold_rmsles", rmsles.tolist()) # list for json compatability
# for optuna dashboard
for i, score in enumerate(rmsles):
trial.report(score, step=i)
return np.mean(rmsles)
With the optimization function defined, we are able to conduct the optimization now:
import optuna
# Add stream handler of stdout to show the messages
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_name = "simple_lgbm_all"
storage_name = f"sqlite:///{study_name}.db"
study = optuna.create_study(
study_name=study_name,
storage=storage_name,
load_if_exists=True
)
study.optimize(objective, n_trials=80)
We load the created study, extract the best parameters and save the resulting model:
loaded_study = optuna.load_study(study_name=study_name, storage=storage_name)
best_params = {stem + key: value for key, value in loaded_study.best_params.items()}
best_simple_model = clone(full_simple_all_features_pipe)
best_simple_model.set_params(**best_params)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OrdinalEncoder(dtype=<class 'int'>,
handle_unknown='use_encoded_value',
unknown_value=-1),
['day', 'month',
'is_new_year', 'local_type',
'regional_type',
'national_type',
'is_leap_year', 'store_nbr',
'family', 'city', 'type',
'cluster']),
('fourier',
FourierTransformer(periods_har...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x39aed8af0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x3612ad240>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=1782))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('simple_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('cat', ...), ('fourier', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x39ae27eb0>
Parameters
| func | <function cas...t 0x39aed8af0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x39aeda050> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x39aed9750> | |
| kw_args | None | |
| inv_kw_args | None |
['sales']
drop
passthrough
Parameters
| regressor | LGBMRegressor..., verbosity=0) | |
| transformer | Pipeline(step...roups=1782))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
LGBMRegressor(colsample_bytree=0.9980277795300263, force_row_wise=True,
learning_rate=0.09437257142042779, max_depth=20,
min_child_samples=663, min_split_gain=0.0719307789840761,
n_estimators=261, num_leaves=179, random_state=42,
reg_alpha=0.002311539997189158, reg_lambda=0.21278967796995524,
subsample=0.8688943601693476, verbosity=0)Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 179 | |
| max_depth | 20 | |
| learning_rate | 0.09437257142042779 | |
| n_estimators | 261 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0719307789840761 | |
| min_child_weight | 0.001 | |
| min_child_samples | 663 | |
| subsample | 0.8688943601693476 | |
| subsample_freq | 0 | |
| colsample_bytree | 0.9980277795300263 | |
| reg_alpha | 0.002311539997189158 | |
| reg_lambda | 0.21278967796995524 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x39aed8af0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x3612ad240> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Let us compare the tuned model with the baselines and the non-tuned version:
best_rmsles = -cross_val_score(
best_simple_model, X_full_all_features, y_full,
scoring="neg_root_mean_squared_log_error",
cv=ts_cv_full,
params={"simple_lgbm__categorical_feature": cat_cols}
)
#best_rmsles = np.array(study.best_trial.user_attrs["fold_rmsles"])
best_labels = {"Last 14 days baseline": full_cv_14days_baseline,
"Last year baseline": full_cv_last_year_baseline,
"Simple LGBM (non-tuned)": full_simple_all_features_rmsles,
"Simple LGBM (tuned)": best_rmsles}
# plot
sns.lineplot(
pd.DataFrame(best_labels)
.assign(Fold=range(1, 6))
.melt(id_vars="Fold", var_name="Metric", value_name="RMSLE"),
x="Fold", y="RMSLE", hue="Metric", style="Metric", dashes=True, markers=True);
plt.xticks(ticks=range(1, 6), labels=range(1, 6));
plt.grid(visible=False, axis="x");
plt.ylim(0.4, 2);
print_cv_results(best_labels)
Last 14 days baseline : 1.8825, 0.6520, 0.7466, 0.5330, 0.5386 Last year baseline : 0.6925, 0.7542, 0.8434, 0.9435, 0.9728 Simple LGBM (non-tuned): 0.8582, 0.6068, 0.5970, 0.4812, 0.4947 Simple LGBM (tuned) : 0.6839, 0.4681, 0.5668, 0.4275, 0.4200
The hyperparameter tuning especially improved the first, second, and last fold. The tuned model now narrowly beats the last-year baseline in the first fold. According to the Optuna Dashboard, there were only three models performing negligibly better on the first fold (the best with a value of 0.6710), which highlights how hard it is to capture the regime shift there. Overall the tuning improved the mean performance by 15.5%. This suggests that even the best overall configuration can only narrowly close the first-fold gap. I tried to improve this further with special features for the first 14 days of January (an additional flag or a feature which equals the day of month for this period and equals zero otherwise), which resulted in worse forecasts for this fold — likely due to irrelevant trees splits, introducing noise rather than meaningful signal.
It is worth noting here that this is a common problem in retail/sales forecasting. The regime shift between years is typically the strongest, since many changes occur simultaneously, such as the abrupt transition from the high-volume pre-Christmas sales to low-sales January weeks, but also structural changes like shop openings and closures, the removal of Christmas products, and the introduction of new ones.
One possible solution is to treat January differently, for example by using the last year-baseline as a predictor for January only. This would make the recursive model much more attractive, since it is reliable on all later folds. However, this makes the prediction pipeline more complex. The simple model — which performs almost as well as this baseline in this fold — has the advantage, that such a logical split is not required, albeit the cost of slightly lower performance at other times of the year.
For now, we accept January as a weak spot and proceed with the simple model and analyzing its errors on the last fold, since this is the closest to the test window and at least folds 2 and 4 behave similarly.
last_fold_model = clone(best_simple_model)
last_fold = ts_cv_full[-1]
last_fold_model.fit(X_full_all_features.iloc[last_fold[0]],
y_full.iloc[last_fold[0]],
simple_lgbm__categorical_feature=cat_cols)
y_pred_train = last_fold_model.predict(X_full_all_features.iloc[last_fold[0]]).reshape(num_series, -1)
y_pred_val = last_fold_model.predict(X_full_all_features.iloc[last_fold[1]]).reshape(num_series, -1)
train["predictions"] = np.concatenate([y_pred_train, y_pred_val], axis=1).flatten()
val_start_last_fold = train.iloc[last_fold[1][0]].date
train["set"] = np.where(train["date"] < val_start_last_fold, "In-Sample", "Out-Of-Sample")
train[["store_nbr", "family", "date", "sales", "predictions", "set"]]
| store_nbr | family | date | sales | predictions | set | |
|---|---|---|---|---|---|---|
| 0 | 1 | AUTOMOTIVE | 2013-01-01 | 0.000 | -0.385117 | In-Sample |
| 1 | 1 | AUTOMOTIVE | 2013-01-02 | 2.000 | 1.823995 | In-Sample |
| 2 | 1 | AUTOMOTIVE | 2013-01-03 | 3.000 | 1.589873 | In-Sample |
| 3 | 1 | AUTOMOTIVE | 2013-01-04 | 3.000 | 1.767063 | In-Sample |
| 4 | 1 | AUTOMOTIVE | 2013-01-05 | 5.000 | 1.820222 | In-Sample |
| ... | ... | ... | ... | ... | ... | ... |
| 3007997 | 9 | SEAFOOD | 2017-07-28 | 11.000 | 10.009208 | Out-Of-Sample |
| 3007998 | 9 | SEAFOOD | 2017-07-29 | 21.916 | 22.168102 | Out-Of-Sample |
| 3007999 | 9 | SEAFOOD | 2017-07-30 | 19.909 | 18.688703 | Out-Of-Sample |
| 3008000 | 9 | SEAFOOD | 2017-07-31 | 12.000 | 14.054185 | Out-Of-Sample |
| 3008001 | 9 | SEAFOOD | 2017-08-01 | 19.316 | 18.527180 | Out-Of-Sample |
2983068 rows × 6 columns
evaluation_start = val_start_last_fold + pd.DateOffset(-56)
eval_df = train.loc[
(train.cluster_rep>=0) & (train.date>=evaluation_start),
["date", "sales", "predictions", "set", "cluster_rep"]
]
date_range = pd.date_range(eval_df.date.min(), eval_df.date.max())
g = sns.relplot(eval_df,
x="date", y="sales", col="cluster_rep", col_wrap=2, kind="line",
height=2.5, aspect=3, facet_kws={"sharey": False}, label="True Value");
for ax in g.axes.flatten():
facet_cluster = float(ax.title.get_text()[-3:])
eval_pred = eval_df[eval_df.cluster_rep==facet_cluster]
ax.plot("date", "predictions", data=eval_pred[eval_pred.set=="In-Sample"],
label="In-Sample Prediction", color="darkgrey")
ax.plot("date", "predictions", data=eval_pred[eval_pred.set=="Out-Of-Sample"],
label="Out-Of-Sample Prediction", color="orange")
ax.set_xlabel("Date")
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MO))
ax.tick_params("x", labelrotation=45)
ymin = np.floor(np.min(eval_pred[["sales", "predictions"]]) - 1)
ymax = np.ceil(np.max(eval_pred[["sales", "predictions"]]) + 1)
ax.set_ylim(ymin, ymax)
ax.set_ylabel("Sales")
ax.set_title(f"Representative of Cluster {int(facet_cluster)}")
#ax.vlines(x=val_start_last_fold, ymin=ymin, ymax=ymax, color="darkgrey")
#ax.annotate("Start Out-of-Sample", xy=(val_start_last_fold + pd.DateOffset(hours=6), ymax*0.95), color="grey")
handles, labels = ax.get_legend_handles_labels()
g.fig.legend(handles, labels,
loc='upper center',
bbox_to_anchor=(0.5, 1.04), # adjust upward offset
ncol=4, # number of columns for legend items
frameon=False)
# add super title
g.fig.suptitle("Predictions and Forecasts of Shape Cluster Representatives", size=16, y=1.08);
plt.subplots_adjust(wspace=0.1)
Series with stronger weekly seasonal patterns, such as the representatives of clusters 1, 3, 5, and especially 0 show better predictions than those of clusters 2 and 4, where the values in the second forecast week look unexpectedly high. In general, the model seems to have problems predicting peaks, which is at least partly due to the chosen metric, RMSLE, which penalizes relational deviations more strongly than absolute ones.
Let us now look at the overall values per shape cluster:
cluster_pattern = re.compile(r"^(membership|is_constant_zero)")
clusters_85 = np.argmax(train[[col for col in train if cluster_pattern.search(col)]], axis=1).reshape(-1, 1)
clusters_85[clusters_85 == 6] = -1 # assign -1 to constant zero series
cluster_preds = np.concatenate([clusters_85, train[["sales", "predictions", "set"]]], axis=1)
cluster_preds[cluster_preds[:, -1] == "Out-Of-Sample"]
rmsles_per_cluster = {}
for cluster in np.arange(-1, 6):
cluster_vals = cluster_preds[cluster_preds[:, 0] == cluster][:, [1, 2]]
rmsles_per_cluster[cluster.item()] = root_mean_squared_log_error(cluster_vals[:, 0], cluster_vals[:, 1])
sns.barplot(rmsles_per_cluster);
plt.xlabel("Shape Cluster (-1 = zero-only series)"); plt.ylabel("RMSLE");
plt.ylim(None, 0.5);
plt.title("RMSLE values per Cluster");
There is some variance across the shape clusters, and the model handles the constantly zero series very well. Interestingly, Cluster 4, representing mostly alcoholic beverages, shows the lowest RMSLE among all non-zero series. We go one step deeper and examine the values for each individual series:
# calculating the values
vals = (train
.sort_values(["store_nbr", "family", "date"])
.loc[train.set == "Out-Of-Sample", ["sales", "predictions"]]
.to_numpy())
vals = vals.reshape(num_series, 14, 2).transpose(2, 1, 0) # -> (true/pred, 14 days, num_series series)
rmsle_per_series = root_mean_squared_log_error(
vals[0], vals[1], multioutput="raw_values"
).reshape(num_stores, num_families).T
# making the plot
sns.histplot(rmsle_per_series.flatten(), binwidth=0.05);
xtickslabels = np.round(np.arange(0, 2.8, 0.2), 1)
ytickslabels = np.arange(0, 240, 20)
plt.xticks(ticks=xtickslabels, labels=xtickslabels);
plt.xlabel("RMSLE")
plt.yticks(ytickslabels, ytickslabels);
plt.title("RMSLE Values of Series");
Most series show RMSLE values below the overall value of 0.42 (note that RMSLE does not scale linearly). Although there is one extreme outlier, only a few series exceed a value of 1.0, so we can regard all these series as particularly hard to model.
plt.figure(figsize=(16, 8))
sns.heatmap(
rmsle_per_series,
vmax=1, # for better visual distinction
cmap="rocket",
cbar=True,
linewidths=0.5,
square=True,
linecolor="grey",
);
plt.xticks(ticks=np.arange(num_stores) + 0.5, labels=stores_cat.categories, rotation=90); # stores sorted numerically
plt.yticks(ticks=np.arange(num_families) + 0.5, labels=families, rotation=0);
plt.tick_params(axis="x", labeltop=True)
plt.title("RMSLE values by Store Number and Product Family", size=14, y=1.08);
Each square represents one series. We can clearly see more horizontal than vertical "stripes", indicating that the product family largely determines how hard a series is to forecast. However, store 52 shows high values in multiple product families. This makes sense, since it contains non-zero values only near the end of the training window, suggesting that it was opened around that time.
We could try to improve the performance by fitting a separate model for each product family or, to keep things more compact, for each shape cluster. Nevertheless, this would also increase the pipeline's complexity and maintenance effort in production. Alternatively, we could also introduce weighted families, for example by using their coefficient of variation as a proxy for its predictability or instability, and cap these weights to prevent extreme series from dominating the optimization.
shape_clustering = make_pipeline(
GroupStandardScaler(num_groups=num_series, inverse_sorted_by_group=True),
ShapeClusteringTransformer(num_series=num_series, num_clusters=n_clusters, fit_window_frac=1.0, use_soft=True)
)
pre = ColumnTransformer(
[("cat", ordinal_encoder, cat_cols), # all categorical cols including day of month and month
("fourier", fourier_transformer, fourier_cols), # weekday, day_of_year
("reals", to_float32, selector_no_sales), # all floats but sales
("dist_clip", clip_transformer, dist_holiday_cols),
("shapes", shape_clustering, ["sales"])
],
remainder="passthrough",
verbose_feature_names_out=False,
)
target_pipe = make_pipeline(
to_float32,
log1p_transformer,
GroupStandardScaler(num_groups=num_series, inverse_sorted_by_group=True)
)
simple_lgbm = TransformedTargetRegressor(
LGBMRegressor(random_state=42,
force_row_wise=True,
verbosity=0),
transformer=target_pipe,
check_inverse=False #essential: avoid subset transform that breaks group-size assumption
)
final_lgbm = Pipeline([
("pre", pre),
("simple_lgbm", simple_lgbm)
])
final_lgbm.set_params(**best_params)
# Remove shape cluster memberships to fit clusters on the full training window later
final_cols = [feature for feature in feature_cols if not cluster_pattern.search(feature)]
final_lgbm.fit(train[final_cols], y_full, simple_lgbm__categorical_feature=cat_cols)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OrdinalEncoder(dtype=<class 'int'>,
handle_unknown='use_encoded_value',
unknown_value=-1),
['day', 'month',
'is_new_year', 'local_type',
'regional_type',
'national_type',
'is_leap_year', 'store_nbr',
'family', 'city', 'type',
'cluster']),
('fourier',
FourierTransformer(periods_har...
FunctionTransformer(feature_names_out='one-to-one',
func=<function cast2float32 at 0x39aed8af0>)),
('functiontransformer-2',
FunctionTransformer(check_inverse=False,
feature_names_out=<function log1p_feature_names at 0x3612ad240>,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>)),
('groupstandardscaler',
GroupStandardScaler(inverse_sorted_by_group=True,
num_groups=1782))])))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre', ...), ('simple_lgbm', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('cat', ...), ('fourier', ...), ...] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['day', 'month', 'is_new_year', 'local_type', 'regional_type', 'national_type', 'is_leap_year', 'store_nbr', 'family', 'city', 'type', 'cluster']
Parameters
| categories | 'auto' | |
| dtype | <class 'int'> | |
| handle_unknown | 'use_encoded_value' | |
| unknown_value | -1 | |
| encoded_missing_value | nan | |
| min_frequency | None | |
| max_categories | None |
['weekday', 'day_of_year']
Parameters
| periods_harmonics | {'day': (30.44, ...), 'day_of_year': (365.2425, ...), 'month': (12, ...), 'weekday': (7, ...)} |
<function selector_no_sales at 0x39ae27eb0>
Parameters
| func | <function cas...t 0x39aed8af0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
['dist_workday', 'dist_local_holiday', 'dist_regional_holiday', 'dist_national_holiday', 'dist_any_holiday', 'dist_christmas', 'dist_viernes_santo', 'dist_terremoto_manabi', 'dist_mundial_ecuador']
Parameters
| func | <function cli...t 0x39aeda050> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | <function cli...t 0x39aed9750> | |
| kw_args | None | |
| inv_kw_args | None |
['sales']
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
Parameters
| num_series | 1782 | |
| week_period | 7 | |
| week_trend | 101 | |
| week_seasonal | 101 | |
| year_period | 365 | |
| year_seasonal | 1095 | |
| num_clusters | 6 | |
| use_soft | True | |
| fit_window_frac | 1.0 | |
| tau | 1.0 |
['days_elapsed', 'year', 'mundial_stage']
passthrough
Parameters
| regressor | LGBMRegressor..., verbosity=0) | |
| transformer | Pipeline(step...roups=1782))]) | |
| func | None | |
| inverse_func | None | |
| check_inverse | False |
LGBMRegressor(colsample_bytree=0.9980277795300263, force_row_wise=True,
learning_rate=0.09437257142042779, max_depth=20,
min_child_samples=663, min_split_gain=0.0719307789840761,
n_estimators=261, num_leaves=179, random_state=42,
reg_alpha=0.002311539997189158, reg_lambda=0.21278967796995524,
subsample=0.8688943601693476, verbosity=0)Parameters
| boosting_type | 'gbdt' | |
| num_leaves | 179 | |
| max_depth | 20 | |
| learning_rate | 0.09437257142042779 | |
| n_estimators | 261 | |
| subsample_for_bin | 200000 | |
| objective | None | |
| class_weight | None | |
| min_split_gain | 0.0719307789840761 | |
| min_child_weight | 0.001 | |
| min_child_samples | 663 | |
| subsample | 0.8688943601693476 | |
| subsample_freq | 0 | |
| colsample_bytree | 0.9980277795300263 | |
| reg_alpha | 0.002311539997189158 | |
| reg_lambda | 0.21278967796995524 | |
| random_state | 42 | |
| n_jobs | None | |
| importance_type | 'split' | |
| force_row_wise | True | |
| verbosity | 0 |
Parameters
| func | <function cas...t 0x39aed8af0> | |
| inverse_func | None | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | True | |
| feature_names_out | 'one-to-one' | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| func | <ufunc 'log1p'> | |
| inverse_func | <ufunc 'expm1'> | |
| validate | False | |
| accept_sparse | False | |
| check_inverse | False | |
| feature_names_out | <function log...t 0x3612ad240> | |
| kw_args | None | |
| inv_kw_args | None |
Parameters
| num_groups | 1782 | |
| inverse_sorted_by_group | True | |
| log_assumed_structure | False |
pred_train = final_lgbm.predict(train[final_cols])
get_metrics(y_full, pred_train)
Negative Predictions detected with minimum -1.00 and clipped to 0
{'RMSLE': 0.3818, 'MAE': 46.0785}
pred_test = final_lgbm.predict(test[final_cols])
get_metrics(test["sales"], pred_test)
Negative Predictions detected with minimum -0.13 and clipped to 0
{'RMSLE': 0.4279, 'MAE': 79.2172}
The RMSLE of the tuned model on the last cross-validation fold (0.42) closely matches the test RMSLE of 0.428, indicating that our cross-validation scheme is reliable. However, the training error (0.38) is slightly lower, indicating mild but still expected overfit. While the relative error remains similar, mean absolute errors increase on the test period, reflecting higher-volume observations and increased volatility in some series.
In production, we could now save the model and run it daily to forecast the next 14 days of sales for all stores at once.
Of course, as already mentioned at the relevant sections, there are still several possibilities one could explore to further improve the forecasts. I summarize them in the following overview.
What Else Could Be Done ↑¶
Data-level improvements
- Research holidays relevant to minority groups in Ecuador: These could influence sales patterns (for example, Muslim holidays).
- Add local, regional, and national holiday types to all days where the distance to a holiday is encoded: This would allow the model to distinguish whether a given day is, for example, two days before a local or a national holiday when combined with
dist_any_holiday.
However, it is questionable whether these changes would lead to substantial performance gains.
Model input and preprocessing
- Add an outlier detection strategy, such as residual-based filtering or clipping: One approach would be to first train a baseline model (for example, LightGBM), predict on the full training set, and clip true values with very large absolute residuals. This method would preserve meaningful event- or holiday-driven spikes already captured by the model, whereas naïve cutoffs (for example, > 3 σ) might incorrectly remove relevant signals. That said, this effectively introduces a second-order forecasting step and increases pipeline complexity.
- Ensure all pipeline steps generalize to new time series: For full production readiness, the pipeline would need to handle unseen series robustly, for example by applying fixed scaling parameters, ensuring stable ordering for PCA-based clustering, and enabling cluster assignment for new series based on their identifiers. For a portfolio project with a fixed set of series, this would likely be overengineered and unnecessarily demanding for the reader.
Training and modeling strategies
- Fit separate models for each product family: This could improve accuracy but would significantly increase pipeline complexity. As a lighter alternative, product families could be weighted differently during training, although this would require experimentation to identify suitable weighting schemes.
- Experiment with different ensemble strategies: For example, combining the simple LightGBM model with linear models for extrapolation-heavy scenarios. Multi-level (hierarchical) linear regression models are particularly interesting, as they allow series-specific intercepts and coefficients while still learning from all series jointly. One possible approach would be to fit such a model with random intercepts (and possibly a small number of random slopes), then train a LightGBM model on the residuals to capture remaining non-linear structure. However, fitting crossed random-effects models on nearly three million rows is computationally and numerically challenging, and such models are not available in scikit-learn.
- Create a forked prediction pipeline: For example, applying the last-year baseline or the simple LightGBM model during the year-to-year regime shift in January, and using the recursive LightGBM model otherwise. This would also increase the pipeline's complexity.
- Train a Temporal Fusion Transformer on the full dataset with explicit regime-shift markers: Such markers could help the model’s attention and variable selection mechanisms adapt locally within the series. This approach is time- and resource-intensive, would likely require cloud training, and comes with uncertain performance gains.
The Dataset from a Sociological and Methodological Perspective ↑¶
As we have seen, the Favorita dataset reflects not only sales dynamics but also organizational and social rhythms, such as weekly and yearly seasonal differences in product preferences, holiday-driven consumption, store openings and closures, and product life cycles. What may appear as a simple signal at first sight is, in fact, a socio-historical artefact, granting insights into Ecuadorian society, with a particular focus on the middle class during the mid-2010s.
Many of the observed regime shifts are therefore not modeling failures but expressions of structural change in the inherently social retail system itself. From a methodological perspective, this highlights the limits of purely data-driven approaches when structural breaks dominate the signal. In such settings, robustness and interpretability become as important as marginal gains in accuracy.
Why Time Series Forecasting Is Fascinating ↑¶
Data scientists and machine learning engineers often prefer other modalities, such as computer vision or natural language processing, over time series forecasting. The reasons are understandable: images and texts already represent highly structured data where signals are usually strong (otherwise their information content would not be obvious to humans), and appropriate models often perform convincingly. Moreover, the data and model outputs are often more spectacular: images and texts can be visualized or even generated. Time series data, in contrast, is often unstructured, shaped by multiple overlapping effects, trends, seasonalities, and regime shifts, which frequently requires substantial effort and statistical knowledge to achieve only incremental performance gains. Designing pipelines that avoid future leakage and constructing appropriate cross-validation schemes add further complexity. On top of that, both the data and the predictions are often reduced to simple lines, and when dealing with thousands of time series simultaneously, it becomes unclear which series should even be plotted.
However, while creating this notebook, I realized that time series can reveal where reality has left its fingerprints, since their temporal structure allows us to trace patterns that are at least suggestive of causality. Even if these signals are often buried beneath other effects, they are still present. Finding structure in the data and making it usable for models is therefore not only tedious but also insightful and inspiring. For example, observing slightly lower values around Good Friday, combined with a brief background search, revealed that these days are celebrated more religiously in Ecuador, leading to reduced consumption compared to Western Europe.
While we cannot forecast the truly unexpected, I still find the outcome of time series forecasting deeply valuable, as it comes close to what humans have attempted for thousands of years: predicting the future. In ancient China, entire writing systems were developed for this purpose. Bones were inscribed with early forms of Chinese characters and exposed to fire; the resulting cracks were interpreted as divine oracle messages — hence the name oracle bone script.
Today, we have replaced such rituals with time series models, but the task remains similar: uncovering the cracks that reality leaves in the data in order to make meaningful predictions.
Conclusions ↑¶
In this project, I explored and preprocessed Ecuadorian historical retail data and developed a preprocessing pipeline to predict sales for the next 14 days. To this end, I first compiled data from multiple files and derived sensible prioritization rules for Ecuadorian holidays directly from the data. Holiday handling also included building a global calendar table to enable correct feature extraction from both dates and holiday information. Since the data comprises thousands of time series, I clustered the series by their shape and used these results to gain deeper insights into the data — such as seasonalities, holiday effects and singular events — and derived features from these without inspecting thousands of single series. These clusters were also used as features themselves and to draw a balanced sample of series for model selection.
I then evaluated a simple LightGBM model, a fully vectorized custom recursive LightGBM model and a Temporal Fusion Transformer on this sample data. The results showed clearly that the LightGBM models perform substantially better on the most recent fold of the time-aware cross-validation. I subsequently trained both tree-based models on all series and selected the simple LightGBM model, as it proved to be the most reliable across all cross-validation folds, including the first fold, which represents the particularly hard-to-predict regime shift between years.
Hyperparameter tuning further improved this model, beating both baselines by 19% on average across all folds. An error analysis on the last fold revealed that forecast errors per series are centered around 0.3 with only a small number of series showing extreme errors. At the same time, the analysis showed some product families are consistently harder to forecast than others. The final evaluation on the test set yielded performance very similar to that of the last cross-validation fold with only mild and expectable overfitting.
Future work could focus on targeted modeling for hard-to-predict product families or on incorporating domain-specific rules for structurally unstable periods such as January.